medcat.utils.legacy.convert_vocab

Classes

Vocab

Vocabulary used to store word embeddings for context similarity

Functions

get_vocab_from_old(old_path)

Convert a v1 vocab file to a v2 Vocab.

Module Contents

class medcat.utils.legacy.convert_vocab.Vocab

Bases: medcat.storage.serialisables.AbstractSerialisable

Vocabulary used to store word embeddings for context similarity calculation. Also used by the spell checker - but not for fixing the spelling only for checking is something correct.

Properties:
vocab (dict[str, WordDescriptor]):
Map from word to attributes, e.g. {‘house’:

{‘vector’: <np.array>, ‘count’: <int>, …}, …}

index2word (dict[int, str]):

From word to an index - used for negative sampling

vec_index2word (dict):

Same as index2word but only words that have vectors

__init__()
Return type:

None

vocab: dict[str, WordDescriptor]
index2word: dict[int, str]
vec_index2word: dict[int, str]
cum_probs: numpy.ndarray
inc_or_add(word, cnt=1, vec=None)

Add a word or increase its count.

Parameters:
  • word (str) – Word to be added

  • cnt (int) – By how much should the count be increased, or to what should it be set if a new word. (Default value = 1)

  • vec (Optional[np.ndarray]) – Word vector (Default value = None)

Return type:

None

remove_all_vectors()

Remove all stored vector representations.

Return type:

None

remove_words_below_cnt(cnt)

Remove all words with frequency below cnt.

Parameters:

cnt (int) – Word count limit.

Return type:

None

_rebuild_index()
inc_wc(word, cnt=1)

Incraese word count by cnt.

Parameters:
  • word (str) – For which word to increase the count

  • cnt (int) – By how muhc to increase the count (Default value = 1)

Return type:

None

add_vec(word, vec)

Add vector to a word.

Parameters:
  • word (str) – To which word to add the vector.

  • vec (np.ndarray) – The vector to add.

Return type:

None

reset_counts(cnt=1)

Reset the count for all word to cnt.

Parameters:

cnt (int) – New count for all words in the vocab. (Default value = 1)

Return type:

None

update_counts(tokens)

Given a list of tokens update counts for words in the vocab.

Parameters:

tokens (list[str]) – Usually a large block of text split into tokens/words.

Return type:

None

add_word(word, cnt=1, vec=None, replace=True)

Add a word to the vocabulary

Parameters:
  • word (str) – The word to be added, it should be lemmatized and lowercased

  • cnt (int) – Count of this word in your dataset (Default value = 1)

  • vec (Optional[np.ndarray]) – The vector representation of the word (Default value = None)

  • replace (bool) – Will replace old vector representation (Default value = True)

Return type:

None

add_words(path, replace=True)

Adds words to the vocab from a file, the file is required to have the following format (vec being optional):

<word> <cnt>[ <vec_space_separated>]

e.g. one line: the word house with 3 dimensional vectors

house 34444 0.3232 0.123213 1.231231

Parameters:
  • path (str) – path to the file with words and vectors

  • replace (bool) – existing words in the vocabulary will be replaced. Defaults to True.

Return type:

None

init_cumsums()

Initialise cumulative sums.

This is in place of the unigram table. But similarly to it, this approach allows generating a list of indices that match the probabilistic distribution expected as per the word counts of each word.

Return type:

None

get_negative_samples(n=6, ignore_punct_and_num=False)

Get N negative samples.

Parameters:
  • n (int) – How many words to return (Default value = 6)

  • ignore_punct_and_num (bool) – Whether to ignore punctuation and numbers. Defaults to False.

Raises:

Exception – If no unigram table is present.

Returns:

list[int] – Indices for words in this vocabulary.

Return type:

list[int]

get_vectors(indices)
Parameters:

indices (list[int])

Return type:

list[numpy.ndarray]

__getitem__(word)
Parameters:

word (str)

Return type:

int

vec(word)
Parameters:

word (str)

Return type:

Optional[numpy.ndarray]

count(word)
Parameters:

word (str)

Return type:

int

item(word)
Parameters:

word (str)

Return type:

WordDescriptor

__contains__(word)
Parameters:

word (str)

Return type:

bool

__eq__(other)
Parameters:

other (Any)

Return type:

bool

save(save_path, serialiser=AvailableSerialisers.dill, overwrite=False)

Save Vocab at path.

Parameters:
  • save_path (str) – The path to save at.

  • serialiser (Union[ str, AvailableSerialisers], optional) – The serialiser. Defaults to AvailableSerialisers.dill.

  • overwrite (bool, optional) – Whether to allow overwriting existing files. Defaults to False.

Return type:

None

classmethod load(path)
Parameters:

path (str)

Return type:

Vocab

get_strategy()
Return type:

SerialisingStrategy

classmethod get_init_attrs()
Return type:

list[str]

classmethod ignore_attrs()
Return type:

list[str]

classmethod include_properties()
Return type:

list[str]

medcat.utils.legacy.convert_vocab.get_vocab_from_old(old_path)

Convert a v1 vocab file to a v2 Vocab.

Parameters:

old_path (str) – The v1 vocab file path.

Returns:

Vocab – The v2 Vocab.

Return type:

medcat.vocab.Vocab