medcat.vocab

Attributes

WordDescriptor

Classes

`AbstractSerialisable`	The abstract serialisable base class.
`AvailableSerialisers`	Describes the available serialisers.
`Vocab`	Vocabulary used to store word embeddings for context similarity

Functions

`deserialise`(folder_path[, ignore_folders_prefix, ...])	Deserialise contents of a folder.
`serialise`(serialiser_type, obj, target_folder[, overwrite])	Serialise an object based on the specified serialiser type.

Module Contents

class medcat.vocab.AbstractSerialisable

The abstract serialisable base class.

This defines some common defaults.

get_strategy()

Return type:: SerialisingStrategy

classmethod get_init_attrs()

Return type:: list[str]

classmethod ignore_attrs()

Return type:: list[str]

classmethod include_properties()

Return type:: list[str]

__eq__(other)

Parameters:: other (Any)
Return type:: bool

medcat.vocab.deserialise(folder_path, ignore_folders_prefix=set(), ignore_folders_suffix=set(), **init_kwargs)

Deserialise contents of a folder.

Extra init keyword arguments can be provided if needed. These are generally: - cnf: The config relevant to the components - tokenizer (BaseTokenizer): The base tokenizer for the model - cdb (CDB): The CDB for the model - vocab (Vocab): The Vocab for the model - model_load_path (Optional[str]): The model load path,

but not the component load path

This method finds the serialiser to be used based on the files on disk.

Parameters:

folder_path (str) – The folder to serialise.
ignore_folders_prefix (set[str]) – The prefixes of folders to ignore.
ignore_folders_suffix (set[str]) – The suffixes of folders to ignore.

Returns:

Serialisable – The deserialised object.

Return type:

medcat.storage.serialisables.Serialisable

class medcat.vocab.AvailableSerialisers

Bases: enum.Enum

Describes the available serialisers.

dill

json

write_to(file_path)

Parameters:: file_path (str)
Return type:: None

classmethod from_file(file_path)

Parameters:: file_path (str)
Return type:: AvailableSerialisers

__new__(value)

_generate_next_value_(start, count, last_values)

Generate the next value when not given.

name: the name of the member start: the initial start value or None count: the number of existing members last_value: the last value assigned or None

classmethod _missing_(value)

__repr__()

__str__()

__dir__(): Returns all members and all public methods

__format__(format_spec): Returns format using actual value type unless __str__ has been overridden.

__hash__()

__reduce_ex__(proto)

name(): The name of the Enum member.

value(): The value of the Enum member.

medcat.vocab.serialise(serialiser_type, obj, target_folder, overwrite=False)

Serialise an object based on the specified serialiser type.

Parameters:

serialiser_type (Union[str, AvailableSerialisers]) – The serialiser type.
obj (Serialisable) – The object to serialise.
target_folder (str) – The folder to serialise into.
overwrite (bool) – Whether to allow overwriting. Defaults to False.

Return type:

None

medcat.vocab.WordDescriptor

class medcat.vocab.Vocab

Bases: medcat.storage.serialisables.AbstractSerialisable

Vocabulary used to store word embeddings for context similarity calculation. Also used by the spell checker - but not for fixing the spelling only for checking is something correct.

Properties:

vocab (dict[str, WordDescriptor]):

Map from word to attributes, e.g. {‘house’:: {‘vector’: <np.array>, ‘count’: <int>, …}, …}

index2word (dict[int, str]):

From word to an index - used for negative sampling

vec_index2word (dict):

Same as index2word but only words that have vectors

__init__()

Return type:: None

vocab: dict[str, WordDescriptor]

index2word: dict[int, str]

vec_index2word: dict[int, str]

cum_probs: numpy.ndarray

inc_or_add(word, cnt=1, vec=None)

Add a word or increase its count.

Parameters:

word (str) – Word to be added
cnt (int) – By how much should the count be increased, or to what should it be set if a new word. (Default value = 1)
vec (Optional[np.ndarray]) – Word vector (Default value = None)

Return type:

None

remove_all_vectors()

Remove all stored vector representations.

Return type:: None

remove_words_below_cnt(cnt)

Remove all words with frequency below cnt.

Parameters:: cnt (int) – Word count limit.
Return type:: None

_rebuild_index()

inc_wc(word, cnt=1)

Incraese word count by cnt.

Parameters:

word (str) – For which word to increase the count
cnt (int) – By how muhc to increase the count (Default value = 1)

Return type:

None

add_vec(word, vec)

Add vector to a word.

Parameters:

word (str) – To which word to add the vector.
vec (np.ndarray) – The vector to add.

Return type:

None

reset_counts(cnt=1)

Reset the count for all word to cnt.

Parameters:: cnt (int) – New count for all words in the vocab. (Default value = 1)
Return type:: None

update_counts(tokens)

Given a list of tokens update counts for words in the vocab.

Parameters:: tokens (list[str]) – Usually a large block of text split into tokens/words.
Return type:: None

add_word(word, cnt=1, vec=None, replace=True)

Add a word to the vocabulary

Parameters:

word (str) – The word to be added, it should be lemmatized and lowercased
cnt (int) – Count of this word in your dataset (Default value = 1)
vec (Optional[np.ndarray]) – The vector representation of the word (Default value = None)
replace (bool) – Will replace old vector representation (Default value = True)

Return type:

None

add_words(path, replace=True)

Adds words to the vocab from a file, the file is required to have the following format (vec being optional):

<word> <cnt>[ <vec_space_separated>]

e.g. one line: the word house with 3 dimensional vectors: house 34444 0.3232 0.123213 1.231231

Parameters:

path (str) – path to the file with words and vectors
replace (bool) – existing words in the vocabulary will be replaced. Defaults to True.

Return type:

None

init_cumsums()

Initialise cumulative sums.

This is in place of the unigram table. But similarly to it, this approach allows generating a list of indices that match the probabilistic distribution expected as per the word counts of each word.

Return type:: None

get_negative_samples(n=6, ignore_punct_and_num=False)

Get N negative samples.

Parameters:

n (int) – How many words to return (Default value = 6)
ignore_punct_and_num (bool) – Whether to ignore punctuation and numbers. Defaults to False.

Raises:

Exception – If no unigram table is present.

Returns:

list[int] – Indices for words in this vocabulary.

Return type:

list[int]

get_vectors(indices)

Parameters:: indices (list[int])
Return type:: list[numpy.ndarray]

__getitem__(word)

Parameters:: word (str)
Return type:: int

vec(word)

Parameters:: word (str)
Return type:: Optional[numpy.ndarray]

count(word)

Parameters:: word (str)
Return type:: int

item(word)

Parameters:: word (str)
Return type:: WordDescriptor

__contains__(word)

Parameters:: word (str)
Return type:: bool

__eq__(other)

Parameters:: other (Any)
Return type:: bool

save(save_path, serialiser=AvailableSerialisers.dill, overwrite=False)

Save Vocab at path.

Parameters:

save_path (str) – The path to save at.
serialiser (Union[ str, AvailableSerialisers], optional) – The serialiser. Defaults to AvailableSerialisers.dill.
overwrite (bool, optional) – Whether to allow overwriting existing files. Defaults to False.

Return type:

None

classmethod load(path)

Parameters:: path (str)
Return type:: Vocab

get_strategy()

Return type:: SerialisingStrategy

classmethod get_init_attrs()

Return type:: list[str]

classmethod ignore_attrs()

Return type:: list[str]

classmethod include_properties()

Return type:: list[str]