medcat.utils.vocab_utils
Attributes
Classes
The abstract serialisable base class. |
|
Vocabulary used to store word embeddings for context similarity |
Functions
|
Calculate the transformation matrix based on the word vectors in the |
|
Helper function to convert the vector. |
|
Use the transformation matrix to convert the word vectors. |
|
Use the transformation matrix to convert the context vectors within the |
|
Convert the vocab vector size to a smaller one. |
Module Contents
- class medcat.utils.vocab_utils.CDB(config)
Bases:
medcat.storage.serialisables.AbstractSerialisableThe abstract serialisable base class.
This defines some common defaults.
- Parameters:
config (medcat.config.Config)
- __init__(config)
- Parameters:
config (medcat.config.Config)
- Return type:
None
- config
- cui2info: dict[str, medcat.cdb.concepts.CUIInfo]
- name2info: dict[str, medcat.cdb.concepts.NameInfo]
- type_id2info: dict[str, medcat.cdb.concepts.TypeInfo]
- token_counts: dict[str, int]
- addl_info: dict[str, Any]
- _subnames: set[str]
- is_dirty = False
- has_changed_names = False
- classmethod get_init_attrs()
- Return type:
list[str]
- _reset_subnames()
- has_subname(name)
Whether the CDB has the specified subname.
- Parameters:
name (str) – The subname to check.
- Returns:
bool – Whether the subname is present in this CDB.
- Return type:
bool
- get_name(cui)
Returns preferred name if it exists, otherwise it will return the longest name assigned to the concept.
- Parameters:
cui (str) – Concept ID or unique identifier in this database.
- Returns:
str – The name of the concept.
- Return type:
str
- weighted_average_function(step)
Get the weighted average for steop.
- Parameters:
step (int) – The steop.
- Returns:
float – The weighted average.
- Return type:
float
- add_types(types)
Add type info to CDB.
- Parameters:
types (Iterable[tuple[str, str]]) – The raw type info.
- Return type:
None
- add_names(cui, names, name_status=ST.AUTOMATIC, full_build=False)
Adds a name to an existing concept.
- Parameters:
cui (str) – Concept ID or unique identifier in this database, all concepts that have the same CUI will be merged internally.
names (dict[str, NameDescriptor]) –
Names for this concept, or the value that if found in free text can be linked to this concept. Names is an dict like: `{name: {‘tokens’: tokens, ‘snames’: snames,
’raw_name’: raw_name}, …}`
Names should be generated by helper function ‘medcat.preprocessing.cleaners.prepare_name’
name_status (str) – One of P, N, A. Defaults to ‘A’.
full_build (bool) – If True the dictionary self.addl_info will also be populated, contains a lot of extra information about concepts, but can be very memory consuming. This is not necessary for normal functioning of MedCAT (Default value False).
- Return type:
None
- _add_concept_names(cui, names, name_status)
- Parameters:
cui (str)
names (dict[str, medcat.preprocessors.cleaners.NameDescriptor])
name_status (str)
- Return type:
None
- _add_full_build(cui, names, ontologies, description, type_ids)
- Parameters:
cui (str)
names (dict[str, medcat.preprocessors.cleaners.NameDescriptor])
ontologies (set[str])
description (str)
type_ids (set[str])
- Return type:
None
- _add_concept(cui, names, ontologies, name_status, type_ids, description, full_build=False)
Add a concept to internal Concept Database (CDB). Depending on what you are providing this will add a large number of properties for each concept.
- Parameters:
cui (str) – Concept ID or unique identifier in this database, all concepts that have the same CUI will be merged internally.
names (dict[str, NameDescriptor]) –
Names for this concept, or the value that if found in free text can be linked to this concept. Names is a dict like: `{name: {‘tokens’: tokens, ‘snames’: snames,
’raw_name’: raw_name}, …}`
Names should be generated by helper function ‘medcat.preprocessing.cleaners.prepare_name’
ontologies (set[str]) – ontologies in which the concept exists (e.g. SNOMEDCT, HPO)
name_status (str) – One of P, N, A
type_ids (set[str]) – Semantic type identifier (have a look at TUIs in UMLS or SNOMED-CT)
description (str) – Description of this concept.
full_build (bool) – If True the dictionary self.addl_info will also be populated, contains a lot of extra information about concepts, but can be very memory consuming. This is not necessary for normal functioning of MedCAT (Default Value False).
- Return type:
None
- reset_training()
Will remove all training efforts - in other words all embeddings that are learnt for concepts in the current CDB. Please note that this does not remove synonyms (names) that were potentially added during supervised/online learning.
- Return type:
None
- filter_by_cui(cuis_to_keep)
Subset the core CDB fields (dictionaries/maps).
Note that this will potenitally keep a bit more CUIs then in cuis_to_keep. It will first find all names that link to the cuis_to_keep and then find all CUIs that link to those names and keep all of them.
This also will not remove any data from cdb.addl_info - as this field can contain data of unknown structure.
- Parameters:
cuis_to_keep (Collection[str]) – CUIs that will be kept, the rest will be removed (not completely, look above).
- Raises:
Exception – If no snames and subsetting is not possible.
- Return type:
None
- remove_cui(cui)
This function takes a CUI and removes it the CDB.
It also removes the CUI from name specific per_cui_status maps as well as well as removes all the names that do not correspond to any CUIs after the removal of this one.
- Parameters:
cui (str) – The CUI to remove.
- Return type:
None
- _remove_names(cui, names)
Remove names from an existing concept - effect is this name will never again be used to link to this concept. This will only remove the name from the linker (namely name2cuis and name2cuis2status), the name will still be present everywhere else. Why? Because it is bothersome to remove it from everywhere, but could also be useful to keep the removed names in e.g. cui2names.
- Parameters:
cui (str) – Concept ID or unique identifier in this database.
names (Iterable[str]) – Names to be removed (e.g list, set, or even a dict (in which case keys will be used)).
- Return type:
None
- __eq__(other)
- Parameters:
other (Any)
- Return type:
bool
- get_cui2count_train()
- Return type:
dict[str, int]
- get_name2count_train()
- Return type:
dict[str, int]
- get_hash()
- Return type:
str
- get_basic_info()
- Return type:
medcat.data.model_card.CDBInfo
- save(save_path, serialiser=AvailableSerialisers.dill, overwrite=False)
Save CDB at path.
- Parameters:
save_path (str) – The path to save at.
serialiser (Union[ str, AvailableSerialisers], optional) – The serialiser. Defaults to AvailableSerialisers.dill.
overwrite (bool, optional) – Whether to allow overwriting existing files. Defaults to False.
- Return type:
None
- get_strategy()
- Return type:
- classmethod ignore_attrs()
- Return type:
list[str]
- classmethod include_properties()
- Return type:
list[str]
- class medcat.utils.vocab_utils.Vocab
Bases:
medcat.storage.serialisables.AbstractSerialisableVocabulary used to store word embeddings for context similarity calculation. Also used by the spell checker - but not for fixing the spelling only for checking is something correct.
- Properties:
- vocab (dict[str, WordDescriptor]):
- Map from word to attributes, e.g. {‘house’:
{‘vector’: <np.array>, ‘count’: <int>, …}, …}
- index2word (dict[int, str]):
From word to an index - used for negative sampling
- vec_index2word (dict):
Same as index2word but only words that have vectors
- __init__()
- Return type:
None
- vocab: dict[str, WordDescriptor]
- index2word: dict[int, str]
- vec_index2word: dict[int, str]
- cum_probs: numpy.ndarray
- inc_or_add(word, cnt=1, vec=None)
Add a word or increase its count.
- Parameters:
word (str) – Word to be added
cnt (int) – By how much should the count be increased, or to what should it be set if a new word. (Default value = 1)
vec (Optional[np.ndarray]) – Word vector (Default value = None)
- Return type:
None
- remove_all_vectors()
Remove all stored vector representations.
- Return type:
None
- remove_words_below_cnt(cnt)
Remove all words with frequency below cnt.
- Parameters:
cnt (int) – Word count limit.
- Return type:
None
- _rebuild_index()
- inc_wc(word, cnt=1)
Incraese word count by cnt.
- Parameters:
word (str) – For which word to increase the count
cnt (int) – By how muhc to increase the count (Default value = 1)
- Return type:
None
- add_vec(word, vec)
Add vector to a word.
- Parameters:
word (str) – To which word to add the vector.
vec (np.ndarray) – The vector to add.
- Return type:
None
- reset_counts(cnt=1)
Reset the count for all word to cnt.
- Parameters:
cnt (int) – New count for all words in the vocab. (Default value = 1)
- Return type:
None
- update_counts(tokens)
Given a list of tokens update counts for words in the vocab.
- Parameters:
tokens (list[str]) – Usually a large block of text split into tokens/words.
- Return type:
None
- add_word(word, cnt=1, vec=None, replace=True)
Add a word to the vocabulary
- Parameters:
word (str) – The word to be added, it should be lemmatized and lowercased
cnt (int) – Count of this word in your dataset (Default value = 1)
vec (Optional[np.ndarray]) – The vector representation of the word (Default value = None)
replace (bool) – Will replace old vector representation (Default value = True)
- Return type:
None
- add_words(path, replace=True)
Adds words to the vocab from a file, the file is required to have the following format (vec being optional):
<word> <cnt>[ <vec_space_separated>]
- e.g. one line: the word house with 3 dimensional vectors
house 34444 0.3232 0.123213 1.231231
- Parameters:
path (str) – path to the file with words and vectors
replace (bool) – existing words in the vocabulary will be replaced. Defaults to True.
- Return type:
None
- init_cumsums()
Initialise cumulative sums.
This is in place of the unigram table. But similarly to it, this approach allows generating a list of indices that match the probabilistic distribution expected as per the word counts of each word.
- Return type:
None
- get_negative_samples(n=6, ignore_punct_and_num=False)
Get N negative samples.
- Parameters:
n (int) – How many words to return (Default value = 6)
ignore_punct_and_num (bool) – Whether to ignore punctuation and numbers. Defaults to False.
- Raises:
Exception – If no unigram table is present.
- Returns:
list[int] – Indices for words in this vocabulary.
- Return type:
list[int]
- get_vectors(indices)
- Parameters:
indices (list[int])
- Return type:
list[numpy.ndarray]
- __getitem__(word)
- Parameters:
word (str)
- Return type:
int
- vec(word)
- Parameters:
word (str)
- Return type:
Optional[numpy.ndarray]
- count(word)
- Parameters:
word (str)
- Return type:
int
- item(word)
- Parameters:
word (str)
- Return type:
WordDescriptor
- __contains__(word)
- Parameters:
word (str)
- Return type:
bool
- __eq__(other)
- Parameters:
other (Any)
- Return type:
bool
- save(save_path, serialiser=AvailableSerialisers.dill, overwrite=False)
Save Vocab at path.
- Parameters:
save_path (str) – The path to save at.
serialiser (Union[ str, AvailableSerialisers], optional) – The serialiser. Defaults to AvailableSerialisers.dill.
overwrite (bool, optional) – Whether to allow overwriting existing files. Defaults to False.
- Return type:
None
- get_strategy()
- Return type:
- classmethod get_init_attrs()
- Return type:
list[str]
- classmethod ignore_attrs()
- Return type:
list[str]
- classmethod include_properties()
- Return type:
list[str]
- medcat.utils.vocab_utils.logger
- medcat.utils.vocab_utils.calc_matrix(vocab, target_size)
Calculate the transformation matrix based on the word vectors in the Vocab.
Performs Principal Component Analysis (PCA). This first means all the word vectors in the Vocab. It then finds the covariance matrix. After that, the eigenvalues and and eigenvectors are calculated. And the target_size eigenvectors corresponding to the largest eigenvalues are selected to create the transformation matrix. :param vocab: The Vocab. :type vocab: Vocab :param target_size: The target vector size. :type target_size: int
- Returns:
np.ndarray – The transformation matrix.
- Parameters:
vocab (medcat.vocab.Vocab)
target_size (int)
- Return type:
numpy.ndarray
- medcat.utils.vocab_utils.convert_vec(cur, matrix, target_dtype=np.float32)
Helper function to convert the vector.
This also guarantees uniform typing (of np.float32) since in our experience some vectors may be of a different type before (i.e np.float64).
- Parameters:
cur (np.ndarray) – The current vector.
matrix (np.ndarray) – The transformation matrix.
target_dtype (Type) – The target element data ype. Defaults to np.float32.
- Returns:
np.ndarray – The transformed vector.
- Return type:
numpy.ndarray
- medcat.utils.vocab_utils.convert_vocab(vocab, matrix)
Use the transformation matrix to convert the word vectors.
- Parameters:
vocab (Vocab) – The Vocab.
matrix (np.ndarray) – The transformation matrix.
- Return type:
None
- medcat.utils.vocab_utils.convert_context_vectors(cdb, matrix)
Use the transformation matrix to convert the context vectors within the CDB.
- Parameters:
cdb (CDB) – The Context Database.
matrix (np.ndarray) – The transformation matrix.
- Return type:
None
- medcat.utils.vocab_utils.convert_vocab_vector_size(cdb, vocab, vec_size)
Convert the vocab vector size to a smaller one.
This uses Principal Component Analysis (PCA). The idea is that we first center all the word vectors (in Vocab), then compute the covariance matrix, then find the eigenvalues and eigenvectors, and then we select the top vec_size eigenvectors. This produces a transformation matrix of shape (vec_size, N), where N is the current vector length in the vocab.
After that, we perform the transformation. First we transform all the vectors in the Vocab. And then we transform all the context vectors defined within the CDB.
NOTE: This requires the CDB as well since the per concept context vectors stored within it are based on the vectors in the vocab and thus they also need to be transformed.