medcat.utils.vocab_utils

Attributes

logger

Classes

`CDB`	The abstract serialisable base class.
`Vocab`	Vocabulary used to store word embeddings for context similarity

Functions

`calc_matrix`(vocab, target_size)	Calculate the transformation matrix based on the word vectors in the
`convert_vec`(cur, matrix[, target_dtype])	Helper function to convert the vector.
`convert_vocab`(vocab, matrix)	Use the transformation matrix to convert the word vectors.
`convert_context_vectors`(cdb, matrix)	Use the transformation matrix to convert the context vectors within the
`convert_vocab_vector_size`(cdb, vocab, vec_size)	Convert the vocab vector size to a smaller one.

Module Contents

class medcat.utils.vocab_utils.CDB(config)

Bases: medcat.storage.serialisables.AbstractSerialisable

The abstract serialisable base class.

This defines some common defaults.

Parameters:: config (medcat.config.Config)

__init__(config)

Parameters:: config (medcat.config.Config)
Return type:: None

config

cui2info: dict[str, medcat.cdb.concepts.CUIInfo]

name2info: dict[str, medcat.cdb.concepts.NameInfo]

type_id2info: dict[str, medcat.cdb.concepts.TypeInfo]

token_counts: dict[str, int]

addl_info: dict[str, Any]

_subnames: set[str]

is_dirty = False

has_changed_names = False

classmethod get_init_attrs()

Return type:: list[str]

_reset_subnames()

has_subname(name)

Whether the CDB has the specified subname.

Parameters:: name (str) – The subname to check.
Returns:: bool – Whether the subname is present in this CDB.
Return type:: bool

get_name(cui)

Returns preferred name if it exists, otherwise it will return the longest name assigned to the concept.

Parameters:: cui (str) – Concept ID or unique identifier in this database.
Returns:: str – The name of the concept.
Return type:: str

weighted_average_function(step)

Get the weighted average for steop.

Parameters:: step (int) – The steop.
Returns:: float – The weighted average.
Return type:: float

add_types(types)

Add type info to CDB.

Parameters:: types (Iterable[tuple[str, str]]) – The raw type info.
Return type:: None

add_names(cui, names, name_status=ST.AUTOMATIC, full_build=False)

Adds a name to an existing concept.

Parameters:

cui (str) – Concept ID or unique identifier in this database, all concepts that have the same CUI will be merged internally.
names (dict[str, NameDescriptor]) –
Names for this concept, or the value that if found in free text can be linked to this concept. Names is an dict like: `{name: {‘tokens’: tokens, ‘snames’: snames,

’raw_name’: raw_name}, …}`

Names should be generated by helper function ‘medcat.preprocessing.cleaners.prepare_name’
name_status (str) – One of P, N, A. Defaults to ‘A’.
full_build (bool) – If True the dictionary self.addl_info will also be populated, contains a lot of extra information about concepts, but can be very memory consuming. This is not necessary for normal functioning of MedCAT (Default value False).

Return type:

None

_add_concept_names(cui, names, name_status)

Parameters:

cui (str)
names (dict[str, medcat.preprocessors.cleaners.NameDescriptor])
name_status (str)

Return type:

None

_add_full_build(cui, names, ontologies, description, type_ids)

Parameters:

cui (str)
names (dict[str, medcat.preprocessors.cleaners.NameDescriptor])
ontologies (set[str])
description (str)
type_ids (set[str])

Return type:

None

_add_concept(cui, names, ontologies, name_status, type_ids, description, full_build=False)

Add a concept to internal Concept Database (CDB). Depending on what you are providing this will add a large number of properties for each concept.

Parameters:

cui (str) – Concept ID or unique identifier in this database, all concepts that have the same CUI will be merged internally.
names (dict[str, NameDescriptor]) –
Names for this concept, or the value that if found in free text can be linked to this concept. Names is a dict like: `{name: {‘tokens’: tokens, ‘snames’: snames,

’raw_name’: raw_name}, …}`

Names should be generated by helper function ‘medcat.preprocessing.cleaners.prepare_name’
ontologies (set[str]) – ontologies in which the concept exists (e.g. SNOMEDCT, HPO)
name_status (str) – One of P, N, A
type_ids (set[str]) – Semantic type identifier (have a look at TUIs in UMLS or SNOMED-CT)
description (str) – Description of this concept.
full_build (bool) – If True the dictionary self.addl_info will also be populated, contains a lot of extra information about concepts, but can be very memory consuming. This is not necessary for normal functioning of MedCAT (Default Value False).

Return type:

None

reset_training()

Will remove all training efforts - in other words all embeddings that are learnt for concepts in the current CDB. Please note that this does not remove synonyms (names) that were potentially added during supervised/online learning.

Return type:: None

filter_by_cui(cuis_to_keep)

Subset the core CDB fields (dictionaries/maps).

Note that this will potenitally keep a bit more CUIs then in cuis_to_keep. It will first find all names that link to the cuis_to_keep and then find all CUIs that link to those names and keep all of them.

This also will not remove any data from cdb.addl_info - as this field can contain data of unknown structure.

Parameters:: cuis_to_keep (Collection[str]) – CUIs that will be kept, the rest will be removed (not completely, look above).
Raises:: Exception – If no snames and subsetting is not possible.
Return type:: None

remove_cui(cui)

This function takes a CUI and removes it the CDB.

It also removes the CUI from name specific per_cui_status maps as well as well as removes all the names that do not correspond to any CUIs after the removal of this one.

Parameters:: cui (str) – The CUI to remove.
Return type:: None

_remove_names(cui, names)

Remove names from an existing concept - effect is this name will never again be used to link to this concept. This will only remove the name from the linker (namely name2cuis and name2cuis2status), the name will still be present everywhere else. Why? Because it is bothersome to remove it from everywhere, but could also be useful to keep the removed names in e.g. cui2names.

Parameters:

cui (str) – Concept ID or unique identifier in this database.
names (Iterable[str]) – Names to be removed (e.g list, set, or even a dict (in which case keys will be used)).

Return type:

None

__eq__(other)

Parameters:: other (Any)
Return type:: bool

get_cui2count_train()

Return type:: dict[str, int]

get_name2count_train()

Return type:: dict[str, int]

get_hash()

Return type:: str

get_basic_info()

Return type:: medcat.data.model_card.CDBInfo

save(save_path, serialiser=AvailableSerialisers.dill, overwrite=False)

Save CDB at path.

Parameters:

save_path (str) – The path to save at.
serialiser (Union[ str, AvailableSerialisers], optional) – The serialiser. Defaults to AvailableSerialisers.dill.
overwrite (bool, optional) – Whether to allow overwriting existing files. Defaults to False.

Return type:

None

classmethod load(path)

Parameters:: path (str)
Return type:: CDB

get_strategy()

Return type:: SerialisingStrategy

classmethod ignore_attrs()

Return type:: list[str]

classmethod include_properties()

Return type:: list[str]

class medcat.utils.vocab_utils.Vocab

Bases: medcat.storage.serialisables.AbstractSerialisable

Vocabulary used to store word embeddings for context similarity calculation. Also used by the spell checker - but not for fixing the spelling only for checking is something correct.

Properties:

vocab (dict[str, WordDescriptor]):

Map from word to attributes, e.g. {‘house’:: {‘vector’: <np.array>, ‘count’: <int>, …}, …}

index2word (dict[int, str]):

From word to an index - used for negative sampling

vec_index2word (dict):

Same as index2word but only words that have vectors

__init__()

Return type:: None

vocab: dict[str, WordDescriptor]

index2word: dict[int, str]

vec_index2word: dict[int, str]

cum_probs: numpy.ndarray

inc_or_add(word, cnt=1, vec=None)

Add a word or increase its count.

Parameters:

word (str) – Word to be added
cnt (int) – By how much should the count be increased, or to what should it be set if a new word. (Default value = 1)
vec (Optional[np.ndarray]) – Word vector (Default value = None)

Return type:

None

remove_all_vectors()

Remove all stored vector representations.

Return type:: None

remove_words_below_cnt(cnt)

Remove all words with frequency below cnt.

Parameters:: cnt (int) – Word count limit.
Return type:: None

_rebuild_index()

inc_wc(word, cnt=1)

Incraese word count by cnt.

Parameters:

word (str) – For which word to increase the count
cnt (int) – By how muhc to increase the count (Default value = 1)

Return type:

None

add_vec(word, vec)

Add vector to a word.

Parameters:

word (str) – To which word to add the vector.
vec (np.ndarray) – The vector to add.

Return type:

None

reset_counts(cnt=1)

Reset the count for all word to cnt.

Parameters:: cnt (int) – New count for all words in the vocab. (Default value = 1)
Return type:: None

update_counts(tokens)

Given a list of tokens update counts for words in the vocab.

Parameters:: tokens (list[str]) – Usually a large block of text split into tokens/words.
Return type:: None

add_word(word, cnt=1, vec=None, replace=True)

Add a word to the vocabulary

Parameters:

word (str) – The word to be added, it should be lemmatized and lowercased
cnt (int) – Count of this word in your dataset (Default value = 1)
vec (Optional[np.ndarray]) – The vector representation of the word (Default value = None)
replace (bool) – Will replace old vector representation (Default value = True)

Return type:

None

add_words(path, replace=True)

Adds words to the vocab from a file, the file is required to have the following format (vec being optional):

<word> <cnt>[ <vec_space_separated>]

e.g. one line: the word house with 3 dimensional vectors: house 34444 0.3232 0.123213 1.231231

Parameters:

path (str) – path to the file with words and vectors
replace (bool) – existing words in the vocabulary will be replaced. Defaults to True.

Return type:

None

init_cumsums()

Initialise cumulative sums.

This is in place of the unigram table. But similarly to it, this approach allows generating a list of indices that match the probabilistic distribution expected as per the word counts of each word.

Return type:: None

get_negative_samples(n=6, ignore_punct_and_num=False)

Get N negative samples.

Parameters:

n (int) – How many words to return (Default value = 6)
ignore_punct_and_num (bool) – Whether to ignore punctuation and numbers. Defaults to False.

Raises:

Exception – If no unigram table is present.

Returns:

list[int] – Indices for words in this vocabulary.

Return type:

list[int]

get_vectors(indices)

Parameters:: indices (list[int])
Return type:: list[numpy.ndarray]

__getitem__(word)

Parameters:: word (str)
Return type:: int

vec(word)

Parameters:: word (str)
Return type:: Optional[numpy.ndarray]

count(word)

Parameters:: word (str)
Return type:: int

item(word)

Parameters:: word (str)
Return type:: WordDescriptor

__contains__(word)

Parameters:: word (str)
Return type:: bool

__eq__(other)

Parameters:: other (Any)
Return type:: bool

save(save_path, serialiser=AvailableSerialisers.dill, overwrite=False)

Save Vocab at path.

Parameters:

save_path (str) – The path to save at.
serialiser (Union[ str, AvailableSerialisers], optional) – The serialiser. Defaults to AvailableSerialisers.dill.
overwrite (bool, optional) – Whether to allow overwriting existing files. Defaults to False.

Return type:

None

classmethod load(path)

Parameters:: path (str)
Return type:: Vocab

get_strategy()

Return type:: SerialisingStrategy

classmethod get_init_attrs()

Return type:: list[str]

classmethod ignore_attrs()

Return type:: list[str]

classmethod include_properties()

Return type:: list[str]

medcat.utils.vocab_utils.logger

medcat.utils.vocab_utils.calc_matrix(vocab, target_size)

Calculate the transformation matrix based on the word vectors in the Vocab.

Performs Principal Component Analysis (PCA). This first means all the word vectors in the Vocab. It then finds the covariance matrix. After that, the eigenvalues and and eigenvectors are calculated. And the target_size eigenvectors corresponding to the largest eigenvalues are selected to create the transformation matrix. :param vocab: The Vocab. :type vocab: Vocab :param target_size: The target vector size. :type target_size: int

Returns:

np.ndarray – The transformation matrix.

Parameters:

vocab (medcat.vocab.Vocab)
target_size (int)

Return type:

numpy.ndarray

medcat.utils.vocab_utils.convert_vec(cur, matrix, target_dtype=np.float32)

Helper function to convert the vector.

This also guarantees uniform typing (of np.float32) since in our experience some vectors may be of a different type before (i.e np.float64).

Parameters:

cur (np.ndarray) – The current vector.
matrix (np.ndarray) – The transformation matrix.
target_dtype (Type) – The target element data ype. Defaults to np.float32.

Returns:

np.ndarray – The transformed vector.

Return type:

numpy.ndarray

medcat.utils.vocab_utils.convert_vocab(vocab, matrix)

Use the transformation matrix to convert the word vectors.

Parameters:

vocab (Vocab) – The Vocab.
matrix (np.ndarray) – The transformation matrix.

Return type:

None

medcat.utils.vocab_utils.convert_context_vectors(cdb, matrix)

Use the transformation matrix to convert the context vectors within the CDB.

Parameters:

cdb (CDB) – The Context Database.
matrix (np.ndarray) – The transformation matrix.

Return type:

None

medcat.utils.vocab_utils.convert_vocab_vector_size(cdb, vocab, vec_size)

Convert the vocab vector size to a smaller one.

This uses Principal Component Analysis (PCA). The idea is that we first center all the word vectors (in Vocab), then compute the covariance matrix, then find the eigenvalues and eigenvectors, and then we select the top vec_size eigenvectors. This produces a transformation matrix of shape (vec_size, N), where N is the current vector length in the vocab.

After that, we perform the transformation. First we transform all the vectors in the Vocab. And then we transform all the context vectors defined within the CDB.

NOTE: This requires the CDB as well since the per concept context vectors stored within it are based on the vectors in the vocab and thus they also need to be transformed.

Parameters:

cdb (CDB) – The Concept Database.
vocab (Vocab) – The Vocab.
vec_size (int) – The target vector size.