medcat.utils.vocab_utils ======================== .. py:module:: medcat.utils.vocab_utils Attributes ---------- .. autoapisummary:: medcat.utils.vocab_utils.logger Classes ------- .. autoapisummary:: medcat.utils.vocab_utils.CDB medcat.utils.vocab_utils.Vocab Functions --------- .. autoapisummary:: medcat.utils.vocab_utils.calc_matrix medcat.utils.vocab_utils.convert_vec medcat.utils.vocab_utils.convert_vocab medcat.utils.vocab_utils.convert_context_vectors medcat.utils.vocab_utils.convert_vocab_vector_size Module Contents --------------- .. py:class:: CDB(config) Bases: :py:obj:`medcat.storage.serialisables.AbstractSerialisable` The abstract serialisable base class. This defines some common defaults. .. py:method:: __init__(config) .. py:attribute:: config .. py:attribute:: cui2info :type: dict[str, medcat.cdb.concepts.CUIInfo] .. py:attribute:: name2info :type: dict[str, medcat.cdb.concepts.NameInfo] .. py:attribute:: type_id2info :type: dict[str, medcat.cdb.concepts.TypeInfo] .. py:attribute:: token_counts :type: dict[str, int] .. py:attribute:: addl_info :type: dict[str, Any] .. py:attribute:: _subnames :type: set[str] .. py:attribute:: is_dirty :value: False .. py:attribute:: has_changed_names :value: False .. py:method:: get_init_attrs() :classmethod: .. py:method:: _reset_subnames() .. py:method:: has_subname(name) Whether the CDB has the specified subname. :param name: The subname to check. :type name: str :Returns: **bool** -- Whether the subname is present in this CDB. .. py:method:: get_name(cui) Returns preferred name if it exists, otherwise it will return the longest name assigned to the concept. :param cui: Concept ID or unique identifier in this database. :type cui: str :Returns: **str** -- The name of the concept. .. py:method:: weighted_average_function(step) Get the weighted average for steop. :param step: The steop. :type step: int :Returns: **float** -- The weighted average. .. py:method:: add_types(types) Add type info to CDB. :param types: The raw type info. :type types: Iterable[tuple[str, str]] .. py:method:: add_names(cui, names, name_status = ST.AUTOMATIC, full_build = False) Adds a name to an existing concept. :param cui: Concept ID or unique identifier in this database, all concepts that have the same CUI will be merged internally. :type cui: str :param names: Names for this concept, or the value that if found in free text can be linked to this concept. Names is an dict like: `{name: {'tokens': tokens, 'snames': snames, 'raw_name': raw_name}, ...}` Names should be generated by helper function 'medcat.preprocessing.cleaners.prepare_name' :type names: dict[str, NameDescriptor] :param name_status: One of `P`, `N`, `A`. Defaults to 'A'. :type name_status: str :param full_build: If True the dictionary self.addl_info will also be populated, contains a lot of extra information about concepts, but can be very memory consuming. This is not necessary for normal functioning of MedCAT (Default value `False`). :type full_build: bool .. py:method:: _add_concept_names(cui, names, name_status) .. py:method:: _add_full_build(cui, names, ontologies, description, type_ids) .. py:method:: _add_concept(cui, names, ontologies, name_status, type_ids, description, full_build = False) Add a concept to internal Concept Database (CDB). Depending on what you are providing this will add a large number of properties for each concept. :param cui: Concept ID or unique identifier in this database, all concepts that have the same CUI will be merged internally. :type cui: str :param names: Names for this concept, or the value that if found in free text can be linked to this concept. Names is a dict like: `{name: {'tokens': tokens, 'snames': snames, 'raw_name': raw_name}, ...}` Names should be generated by helper function 'medcat.preprocessing.cleaners.prepare_name' :type names: dict[str, NameDescriptor] :param ontologies: ontologies in which the concept exists (e.g. SNOMEDCT, HPO) :type ontologies: set[str] :param name_status: One of `P`, `N`, `A` :type name_status: str :param type_ids: Semantic type identifier (have a look at TUIs in UMLS or SNOMED-CT) :type type_ids: set[str] :param description: Description of this concept. :type description: str :param full_build: If True the dictionary self.addl_info will also be populated, contains a lot of extra information about concepts, but can be very memory consuming. This is not necessary for normal functioning of MedCAT (Default Value `False`). :type full_build: bool .. py:method:: reset_training() Will remove all training efforts - in other words all embeddings that are learnt for concepts in the current CDB. Please note that this does not remove synonyms (names) that were potentially added during supervised/online learning. .. py:method:: filter_by_cui(cuis_to_keep) Subset the core CDB fields (dictionaries/maps). Note that this will potenitally keep a bit more CUIs then in cuis_to_keep. It will first find all names that link to the cuis_to_keep and then find all CUIs that link to those names and keep all of them. This also will not remove any data from cdb.addl_info - as this field can contain data of unknown structure. :param cuis_to_keep: CUIs that will be kept, the rest will be removed (not completely, look above). :type cuis_to_keep: Collection[str] :raises Exception: If no snames and subsetting is not possible. .. py:method:: remove_cui(cui) This function takes a CUI and removes it the CDB. It also removes the CUI from name specific per_cui_status maps as well as well as removes all the names that do not correspond to any CUIs after the removal of this one. :param cui: The CUI to remove. :type cui: str .. py:method:: _remove_names(cui, names) Remove names from an existing concept - effect is this name will never again be used to link to this concept. This will only remove the name from the linker (namely name2cuis and name2cuis2status), the name will still be present everywhere else. Why? Because it is bothersome to remove it from everywhere, but could also be useful to keep the removed names in e.g. cui2names. :param cui: Concept ID or unique identifier in this database. :type cui: str :param names: Names to be removed (e.g list, set, or even a dict (in which case keys will be used)). :type names: Iterable[str] .. py:method:: __eq__(other) .. py:method:: get_cui2count_train() .. py:method:: get_name2count_train() .. py:method:: get_hash() .. py:method:: get_basic_info() .. py:method:: save(save_path, serialiser = AvailableSerialisers.dill, overwrite = False) Save CDB at path. :param save_path: The path to save at. :type save_path: str :param serialiser: The serialiser. Defaults to AvailableSerialisers.dill. :type serialiser: Union[ str, AvailableSerialisers], optional :param overwrite: Whether to allow overwriting existing files. Defaults to False. :type overwrite: bool, optional .. py:method:: load(path) :classmethod: .. py:method:: get_strategy() .. py:method:: ignore_attrs() :classmethod: .. py:method:: include_properties() :classmethod: .. py:class:: Vocab Bases: :py:obj:`medcat.storage.serialisables.AbstractSerialisable` Vocabulary used to store word embeddings for context similarity calculation. Also used by the spell checker - but not for fixing the spelling only for checking is something correct. Properties: vocab (dict[str, WordDescriptor]): Map from word to attributes, e.g. {'house': {'vector': , 'count': , ...}, ...} index2word (dict[int, str]): From word to an index - used for negative sampling vec_index2word (dict): Same as index2word but only words that have vectors .. py:method:: __init__() .. py:attribute:: vocab :type: dict[str, WordDescriptor] .. py:attribute:: index2word :type: dict[int, str] .. py:attribute:: vec_index2word :type: dict[int, str] .. py:attribute:: cum_probs :type: numpy.ndarray .. py:method:: inc_or_add(word, cnt = 1, vec = None) Add a word or increase its count. :param word: Word to be added :type word: str :param cnt: By how much should the count be increased, or to what should it be set if a new word. (Default value = 1) :type cnt: int :param vec: Word vector (Default value = None) :type vec: Optional[np.ndarray] .. py:method:: remove_all_vectors() Remove all stored vector representations. .. py:method:: remove_words_below_cnt(cnt) Remove all words with frequency below cnt. :param cnt: Word count limit. :type cnt: int .. py:method:: _rebuild_index() .. py:method:: inc_wc(word, cnt = 1) Incraese word count by cnt. :param word: For which word to increase the count :type word: str :param cnt: By how muhc to increase the count (Default value = 1) :type cnt: int .. py:method:: add_vec(word, vec) Add vector to a word. :param word: To which word to add the vector. :type word: str :param vec: The vector to add. :type vec: np.ndarray .. py:method:: reset_counts(cnt = 1) Reset the count for all word to cnt. :param cnt: New count for all words in the vocab. (Default value = 1) :type cnt: int .. py:method:: update_counts(tokens) Given a list of tokens update counts for words in the vocab. :param tokens: Usually a large block of text split into tokens/words. :type tokens: list[str] .. py:method:: add_word(word, cnt = 1, vec = None, replace = True) Add a word to the vocabulary :param word: The word to be added, it should be lemmatized and lowercased :type word: str :param cnt: Count of this word in your dataset (Default value = 1) :type cnt: int :param vec: The vector representation of the word (Default value = None) :type vec: Optional[np.ndarray] :param replace: Will replace old vector representation (Default value = True) :type replace: bool .. py:method:: add_words(path, replace = True) Adds words to the vocab from a file, the file is required to have the following format (vec being optional): [ ] e.g. one line: the word house with 3 dimensional vectors house 34444 0.3232 0.123213 1.231231 :param path: path to the file with words and vectors :type path: str :param replace: existing words in the vocabulary will be replaced. Defaults to True. :type replace: bool .. py:method:: init_cumsums() Initialise cumulative sums. This is in place of the unigram table. But similarly to it, this approach allows generating a list of indices that match the probabilistic distribution expected as per the word counts of each word. .. py:method:: get_negative_samples(n = 6, ignore_punct_and_num = False) Get N negative samples. :param n: How many words to return (Default value = 6) :type n: int :param ignore_punct_and_num: Whether to ignore punctuation and numbers. Defaults to False. :type ignore_punct_and_num: bool :raises Exception: If no unigram table is present. :Returns: **list[int]** -- Indices for words in this vocabulary. .. py:method:: get_vectors(indices) .. py:method:: __getitem__(word) .. py:method:: vec(word) .. py:method:: count(word) .. py:method:: item(word) .. py:method:: __contains__(word) .. py:method:: __eq__(other) .. py:method:: save(save_path, serialiser = AvailableSerialisers.dill, overwrite = False) Save Vocab at path. :param save_path: The path to save at. :type save_path: str :param serialiser: The serialiser. Defaults to AvailableSerialisers.dill. :type serialiser: Union[ str, AvailableSerialisers], optional :param overwrite: Whether to allow overwriting existing files. Defaults to False. :type overwrite: bool, optional .. py:method:: load(path) :classmethod: .. py:method:: get_strategy() .. py:method:: get_init_attrs() :classmethod: .. py:method:: ignore_attrs() :classmethod: .. py:method:: include_properties() :classmethod: .. py:data:: logger .. py:function:: calc_matrix(vocab, target_size) Calculate the transformation matrix based on the word vectors in the Vocab. Performs Principal Component Analysis (PCA). This first means all the word vectors in the Vocab. It then finds the covariance matrix. After that, the eigenvalues and and eigenvectors are calculated. And the `target_size` eigenvectors corresponding to the largest eigenvalues are selected to create the transformation matrix. :param vocab: The Vocab. :type vocab: Vocab :param target_size: The target vector size. :type target_size: int :Returns: **np.ndarray** -- The transformation matrix. .. py:function:: convert_vec(cur, matrix, target_dtype = np.float32) Helper function to convert the vector. This also guarantees uniform typing (of np.float32) since in our experience some vectors may be of a different type before (i.e np.float64). :param cur: The current vector. :type cur: np.ndarray :param matrix: The transformation matrix. :type matrix: np.ndarray :param target_dtype: The target element data ype. Defaults to np.float32. :type target_dtype: Type :Returns: **np.ndarray** -- The transformed vector. .. py:function:: convert_vocab(vocab, matrix) Use the transformation matrix to convert the word vectors. :param vocab: The Vocab. :type vocab: Vocab :param matrix: The transformation matrix. :type matrix: np.ndarray .. py:function:: convert_context_vectors(cdb, matrix) Use the transformation matrix to convert the context vectors within the CDB. :param cdb: The Context Database. :type cdb: CDB :param matrix: The transformation matrix. :type matrix: np.ndarray .. py:function:: convert_vocab_vector_size(cdb, vocab, vec_size) Convert the vocab vector size to a smaller one. This uses Principal Component Analysis (PCA). The idea is that we first center all the word vectors (in Vocab), then compute the covariance matrix, then find the eigenvalues and eigenvectors, and then we select the top `vec_size` eigenvectors. This produces a transformation matrix of shape (vec_size, N), where N is the current vector length in the vocab. After that, we perform the transformation. First we transform all the vectors in the Vocab. And then we transform all the context vectors defined within the CDB. NOTE: This requires the CDB as well since the per concept context vectors stored within it are based on the vectors in the vocab and thus they also need to be transformed. :param cdb: The Concept Database. :type cdb: CDB :param vocab: The Vocab. :type vocab: Vocab :param vec_size: The target vector size. :type vec_size: int