medcat.utils.legacy.convert_vocab

Classes

Vocab

Vocabulary used to store word embeddings for context similarity

Functions

get_vocab_from_old(old_path)

Convert a v1 vocab file to a v2 Vocab.

Module Contents

class medcat.utils.legacy.convert_vocab.Vocab

Bases: medcat.storage.serialisables.AbstractSerialisable

Vocabulary used to store word embeddings for context similarity calculation. Also used by the spell checker - but not for fixing the spelling only for checking is something correct.

Properties:

vocab (dict[str, WordDescriptor]):

Map from word to attributes, e.g. {‘house’:: {‘vector’: <np.array>, ‘count’: <int>, …}, …}

index2word (dict[int, str]):

From word to an index - used for negative sampling

vec_index2word (dict):

Same as index2word but only words that have vectors

__init__()

Return type:: None

vocab: dict[str, WordDescriptor]

index2word: dict[int, str]

vec_index2word: dict[int, str]

cum_probs: numpy.ndarray

inc_or_add(word, cnt=1, vec=None)

Add a word or increase its count.

Parameters:

word (str) – Word to be added
cnt (int) – By how much should the count be increased, or to what should it be set if a new word. (Default value = 1)
vec (Optional[np.ndarray]) – Word vector (Default value = None)

Return type:

None

remove_all_vectors()

Remove all stored vector representations.

Return type:: None

remove_words_below_cnt(cnt)

Remove all words with frequency below cnt.

Parameters:: cnt (int) – Word count limit.
Return type:: None

_rebuild_index()

inc_wc(word, cnt=1)

Incraese word count by cnt.

Parameters:

word (str) – For which word to increase the count
cnt (int) – By how muhc to increase the count (Default value = 1)

Return type:

None

add_vec(word, vec)

Add vector to a word.

Parameters:

word (str) – To which word to add the vector.
vec (np.ndarray) – The vector to add.

Return type:

None

reset_counts(cnt=1)

Reset the count for all word to cnt.

Parameters:: cnt (int) – New count for all words in the vocab. (Default value = 1)
Return type:: None

update_counts(tokens)

Given a list of tokens update counts for words in the vocab.

Parameters:: tokens (list[str]) – Usually a large block of text split into tokens/words.
Return type:: None

add_word(word, cnt=1, vec=None, replace=True)

Add a word to the vocabulary

Parameters:

word (str) – The word to be added, it should be lemmatized and lowercased
cnt (int) – Count of this word in your dataset (Default value = 1)
vec (Optional[np.ndarray]) – The vector representation of the word (Default value = None)
replace (bool) – Will replace old vector representation (Default value = True)

Return type:

None

add_words(path, replace=True)

Adds words to the vocab from a file, the file is required to have the following format (vec being optional):

<word> <cnt>[ <vec_space_separated>]

e.g. one line: the word house with 3 dimensional vectors: house 34444 0.3232 0.123213 1.231231

Parameters:

path (str) – path to the file with words and vectors
replace (bool) – existing words in the vocabulary will be replaced. Defaults to True.

Return type:

None

init_cumsums()

Initialise cumulative sums.

This is in place of the unigram table. But similarly to it, this approach allows generating a list of indices that match the probabilistic distribution expected as per the word counts of each word.

Return type:: None

get_negative_samples(n=6, ignore_punct_and_num=False)

Get N negative samples.

Parameters:

n (int) – How many words to return (Default value = 6)
ignore_punct_and_num (bool) – Whether to ignore punctuation and numbers. Defaults to False.

Raises:

Exception – If no unigram table is present.

Returns:

list[int] – Indices for words in this vocabulary.

Return type:

list[int]

get_vectors(indices)

Parameters:: indices (list[int])
Return type:: list[numpy.ndarray]

__getitem__(word)

Parameters:: word (str)
Return type:: int

vec(word)

Parameters:: word (str)
Return type:: Optional[numpy.ndarray]

count(word)

Parameters:: word (str)
Return type:: int

item(word)

Parameters:: word (str)
Return type:: WordDescriptor

__contains__(word)

Parameters:: word (str)
Return type:: bool

__eq__(other)

Parameters:: other (Any)
Return type:: bool

save(save_path, serialiser=AvailableSerialisers.dill, overwrite=False)

Save Vocab at path.

Parameters:

save_path (str) – The path to save at.
serialiser (Union[ str, AvailableSerialisers], optional) – The serialiser. Defaults to AvailableSerialisers.dill.
overwrite (bool, optional) – Whether to allow overwriting existing files. Defaults to False.

Return type:

None

classmethod load(path)

Parameters:: path (str)
Return type:: Vocab

get_strategy()

Return type:: SerialisingStrategy

classmethod get_init_attrs()

Return type:: list[str]

classmethod ignore_attrs()

Return type:: list[str]

classmethod include_properties()

Return type:: list[str]

medcat.utils.legacy.convert_vocab.get_vocab_from_old(old_path)

Convert a v1 vocab file to a v2 Vocab.

Parameters:: old_path (str) – The v1 vocab file path.
Returns:: Vocab – The v2 Vocab.
Return type:: medcat.vocab.Vocab