medcat.vocab

Attributes

WordDescriptor

Classes

AbstractSerialisable

The abstract serialisable base class.

Vocab

Vocabulary used to store word embeddings for context similarity

Module Contents

class medcat.vocab.AbstractSerialisable

The abstract serialisable base class.

This defines some common defaults.

get_strategy()
Return type:

SerialisingStrategy

classmethod get_init_attrs()
Return type:

list[str]

classmethod ignore_attrs()
Return type:

list[str]

classmethod include_properties()
Return type:

list[str]

__eq__(other)
Parameters:

other (Any)

Return type:

bool

medcat.vocab.WordDescriptor
class medcat.vocab.Vocab

Bases: medcat.storage.serialisables.AbstractSerialisable

Vocabulary used to store word embeddings for context similarity calculation. Also used by the spell checker - but not for fixing the spelling only for checking is something correct.

Properties:
vocab (dict[str, WordDescriptor]):
Map from word to attributes, e.g. {‘house’:

{‘vector’: <np.array>, ‘count’: <int>, …}, …}

index2word (dict[int, str]):

From word to an index - used for negative sampling

vec_index2word (dict):

Same as index2word but only words that have vectors

__init__()
Return type:

None

vocab: dict[str, WordDescriptor]
index2word: dict[int, str]
vec_index2word: dict[int, str]
cum_probs: numpy.ndarray
inc_or_add(word, cnt=1, vec=None)

Add a word or increase its count.

Parameters:
  • word (str) – Word to be added

  • cnt (int) – By how much should the count be increased, or to what should it be set if a new word. (Default value = 1)

  • vec (Optional[np.ndarray]) – Word vector (Default value = None)

Return type:

None

remove_all_vectors()

Remove all stored vector representations.

Return type:

None

remove_words_below_cnt(cnt)

Remove all words with frequency below cnt.

Parameters:

cnt (int) – Word count limit.

Return type:

None

_rebuild_index()
inc_wc(word, cnt=1)

Incraese word count by cnt.

Parameters:
  • word (str) – For which word to increase the count

  • cnt (int) – By how muhc to increase the count (Default value = 1)

Return type:

None

add_vec(word, vec)

Add vector to a word.

Parameters:
  • word (str) – To which word to add the vector.

  • vec (np.ndarray) – The vector to add.

Return type:

None

reset_counts(cnt=1)

Reset the count for all word to cnt.

Parameters:

cnt (int) – New count for all words in the vocab. (Default value = 1)

Return type:

None

update_counts(tokens)

Given a list of tokens update counts for words in the vocab.

Parameters:

tokens (list[str]) – Usually a large block of text split into tokens/words.

Return type:

None

add_word(word, cnt=1, vec=None, replace=True)

Add a word to the vocabulary

Parameters:
  • word (str) – The word to be added, it should be lemmatized and lowercased

  • cnt (int) – Count of this word in your dataset (Default value = 1)

  • vec (Optional[np.ndarray]) – The vector representation of the word (Default value = None)

  • replace (bool) – Will replace old vector representation (Default value = True)

Return type:

None

add_words(path, replace=True)

Adds words to the vocab from a file, the file is required to have the following format (vec being optional):

<word> <cnt>[ <vec_space_separated>]

e.g. one line: the word house with 3 dimensional vectors

house 34444 0.3232 0.123213 1.231231

Parameters:
  • path (str) – path to the file with words and vectors

  • replace (bool) – existing words in the vocabulary will be replaced. Defaults to True.

Return type:

None

init_cumsums()

Initialise cumulative sums.

This is in place of the unigram table. But similarly to it, this approach allows generating a list of indices that match the probabilistic distribution expected as per the word counts of each word.

Return type:

None

get_negative_samples(n=6, ignore_punct_and_num=False)

Get N negative samples.

Parameters:
  • n (int) – How many words to return (Default value = 6)

  • ignore_punct_and_num (bool) – Whether to ignore punctuation and numbers. Defaults to False.

Raises:

Exception – If no unigram table is present.

Returns:

list[int] – Indices for words in this vocabulary.

Return type:

list[int]

get_vectors(indices)
Parameters:

indices (list[int])

Return type:

list[numpy.ndarray]

__getitem__(word)
Parameters:

word (str)

Return type:

int

vec(word)
Parameters:

word (str)

Return type:

Optional[numpy.ndarray]

count(word)
Parameters:

word (str)

Return type:

int

item(word)
Parameters:

word (str)

Return type:

WordDescriptor

__contains__(word)
Parameters:

word (str)

Return type:

bool

__eq__(other)
Parameters:

other (Any)

Return type:

bool

get_strategy()
Return type:

SerialisingStrategy

classmethod get_init_attrs()
Return type:

list[str]

classmethod ignore_attrs()
Return type:

list[str]

classmethod include_properties()
Return type:

list[str]