medcat.utils.vocab_utils
========================

.. py:module:: medcat.utils.vocab_utils


Attributes
----------

.. autoapisummary::

   medcat.utils.vocab_utils.logger


Classes
-------

.. autoapisummary::

   medcat.utils.vocab_utils.CDB
   medcat.utils.vocab_utils.Vocab


Functions
---------

.. autoapisummary::

   medcat.utils.vocab_utils.calc_matrix
   medcat.utils.vocab_utils.convert_vec
   medcat.utils.vocab_utils.convert_vocab
   medcat.utils.vocab_utils.convert_context_vectors
   medcat.utils.vocab_utils.convert_vocab_vector_size


Module Contents
---------------

.. py:class:: CDB(config)

   Bases: :py:obj:`medcat.storage.serialisables.AbstractSerialisable`


   The abstract serialisable base class.

   This defines some common defaults.


   .. py:method:: __init__(config)


   .. py:attribute:: config


   .. py:attribute:: cui2info
      :type:  dict[str, medcat.cdb.concepts.CUIInfo]


   .. py:attribute:: name2info
      :type:  dict[str, medcat.cdb.concepts.NameInfo]


   .. py:attribute:: type_id2info
      :type:  dict[str, medcat.cdb.concepts.TypeInfo]


   .. py:attribute:: token_counts
      :type:  dict[str, int]


   .. py:attribute:: addl_info
      :type:  dict[str, Any]


   .. py:attribute:: _subnames
      :type:  set[str]


   .. py:attribute:: is_dirty
      :value: False


   .. py:attribute:: has_changed_names
      :value: False


   .. py:method:: get_init_attrs()
      :classmethod:


   .. py:method:: _reset_subnames()


   .. py:method:: has_subname(name)

      Whether the CDB has the specified subname.

      :param name: The subname to check.
      :type name: str

      :Returns: **bool** -- Whether the subname is present in this CDB.


   .. py:method:: get_name(cui)

      Returns preferred name if it exists, otherwise it will return
      the longest name assigned to the concept.

      :param cui: Concept ID or unique identifier in this database.
      :type cui: str

      :Returns: **str** -- The name of the concept.


   .. py:method:: weighted_average_function(step)

      Get the weighted average for steop.

      :param step: The steop.
      :type step: int

      :Returns: **float** -- The weighted average.


   .. py:method:: add_types(types)

      Add type info to CDB.

      :param types: The raw type info.
      :type types: Iterable[tuple[str, str]]


   .. py:method:: add_names(cui, names, name_status = ST.AUTOMATIC, full_build = False)

      Adds a name to an existing concept.

      :param cui: Concept ID or unique identifier in this database, all concepts
                  that have the same CUI will be merged internally.
      :type cui: str
      :param names: Names for this concept, or the value that if found in free
                    text can be linked to this concept. Names is an dict like:
                    `{name: {'tokens': tokens, 'snames': snames,
                             'raw_name': raw_name}, ...}`
                    Names should be generated by helper function
                    'medcat.preprocessing.cleaners.prepare_name'
      :type names: dict[str, NameDescriptor]
      :param name_status: One of `P`, `N`, `A`. Defaults to 'A'.
      :type name_status: str
      :param full_build: If True the dictionary self.addl_info will also be populated,
                         contains a lot of extra information about concepts, but can be
                         very memory consuming. This is not necessary for normal
                         functioning of MedCAT (Default value `False`).
      :type full_build: bool


   .. py:method:: _add_concept_names(cui, names, name_status)


   .. py:method:: _add_full_build(cui, names, ontologies, description, type_ids)


   .. py:method:: _add_concept(cui, names, ontologies, name_status, type_ids, description, full_build = False)

      Add a concept to internal Concept Database (CDB). Depending on what
      you are providing this will add a large number of properties for each
      concept.

      :param cui: Concept ID or unique identifier in this database, all concepts
                  that have the same CUI will be merged internally.
      :type cui: str
      :param names: Names for this concept, or the value that if found in free
                    text can be linked to this concept. Names is a dict like:
                    `{name: {'tokens': tokens, 'snames': snames,
                             'raw_name': raw_name}, ...}`
                    Names should be generated by helper function
                    'medcat.preprocessing.cleaners.prepare_name'
      :type names: dict[str, NameDescriptor]
      :param ontologies: ontologies in which the concept exists (e.g. SNOMEDCT, HPO)
      :type ontologies: set[str]
      :param name_status: One of `P`, `N`, `A`
      :type name_status: str
      :param type_ids: Semantic type identifier (have a look at TUIs in UMLS or
                       SNOMED-CT)
      :type type_ids: set[str]
      :param description: Description of this concept.
      :type description: str
      :param full_build: If True the dictionary self.addl_info will also be populated,
                         contains a lot of extra information about concepts, but can be
                         very memory consuming. This is not necessary for normal
                         functioning of MedCAT (Default Value `False`).
      :type full_build: bool


   .. py:method:: reset_training()

      Will remove all training efforts - in other words all embeddings
      that are learnt for concepts in the current CDB. Please note that this
      does not remove synonyms (names) that were potentially added during
      supervised/online learning.


   .. py:method:: filter_by_cui(cuis_to_keep)

      Subset the core CDB fields (dictionaries/maps).

      Note that this will potenitally keep a bit more CUIs
      then in cuis_to_keep. It will first find all names that
      link to the cuis_to_keep and then find all CUIs that
      link to those names and keep all of them.

      This also will not remove any data from cdb.addl_info -
      as this field can contain data of unknown structure.

      :param cuis_to_keep: CUIs that will be kept, the rest will be removed
                           (not completely, look above).
      :type cuis_to_keep: Collection[str]

      :raises Exception: If no snames and subsetting is not possible.


   .. py:method:: remove_cui(cui)

      This function takes a CUI and removes it the CDB.

      It also removes the CUI from name specific per_cui_status
      maps as well as well as removes all the names that do not
      correspond to any CUIs after the removal of this one.

      :param cui: The CUI to remove.
      :type cui: str


   .. py:method:: _remove_names(cui, names)

      Remove names from an existing concept - effect is this name will
      never again be used to link to this concept. This will only remove the
      name from the linker (namely name2cuis and name2cuis2status), the name
      will still be present everywhere else. Why? Because it is bothersome
      to remove it from everywhere, but could also be useful to keep the
      removed names in e.g. cui2names.

      :param cui: Concept ID or unique identifier in this database.
      :type cui: str
      :param names: Names to be removed (e.g list, set, or even a dict (in which
                    case keys will be used)).
      :type names: Iterable[str]


   .. py:method:: __eq__(other)


   .. py:method:: get_cui2count_train()


   .. py:method:: get_name2count_train()


   .. py:method:: get_hash()


   .. py:method:: get_basic_info()


   .. py:method:: save(save_path, serialiser = AvailableSerialisers.dill, overwrite = False)

      Save CDB at path.

      :param save_path: The path to save at.
      :type save_path: str
      :param serialiser: The serialiser. Defaults to AvailableSerialisers.dill.
      :type serialiser: Union[ str, AvailableSerialisers], optional
      :param overwrite: Whether to allow overwriting existing files. Defaults to False.
      :type overwrite: bool, optional


   .. py:method:: load(path)
      :classmethod:


   .. py:method:: get_strategy()


   .. py:method:: ignore_attrs()
      :classmethod:


   .. py:method:: include_properties()
      :classmethod:


.. py:class:: Vocab

   Bases: :py:obj:`medcat.storage.serialisables.AbstractSerialisable`


   Vocabulary used to store word embeddings for context similarity
   calculation. Also used by the spell checker - but not for fixing the
   spelling only for checking is something correct.

   Properties:
       vocab (dict[str, WordDescriptor]):
           Map from word to attributes, e.g. {'house':
               {'vector': <np.array>, 'count': <int>, ...}, ...}
       index2word (dict[int, str]):
           From word to an index - used for negative sampling
       vec_index2word (dict):
           Same as index2word but only words that have vectors


   .. py:method:: __init__()


   .. py:attribute:: vocab
      :type:  dict[str, WordDescriptor]


   .. py:attribute:: index2word
      :type:  dict[int, str]


   .. py:attribute:: vec_index2word
      :type:  dict[int, str]


   .. py:attribute:: cum_probs
      :type:  numpy.ndarray


   .. py:method:: inc_or_add(word, cnt = 1, vec = None)

      Add a word or increase its count.

      :param word: Word to be added
      :type word: str
      :param cnt: By how much should the count be increased, or to what
                  should it be set if a new word. (Default value = 1)
      :type cnt: int
      :param vec: Word vector (Default value = None)
      :type vec: Optional[np.ndarray]


   .. py:method:: remove_all_vectors()

      Remove all stored vector representations.


   .. py:method:: remove_words_below_cnt(cnt)

      Remove all words with frequency below cnt.

      :param cnt: Word count limit.
      :type cnt: int


   .. py:method:: _rebuild_index()


   .. py:method:: inc_wc(word, cnt = 1)

      Incraese word count by cnt.

      :param word: For which word to increase the count
      :type word: str
      :param cnt: By how muhc to increase the count (Default value = 1)
      :type cnt: int


   .. py:method:: add_vec(word, vec)

      Add vector to a word.

      :param word: To which word to add the vector.
      :type word: str
      :param vec: The vector to add.
      :type vec: np.ndarray


   .. py:method:: reset_counts(cnt = 1)

      Reset the count for all word to cnt.

      :param cnt: New count for all words in the vocab. (Default value = 1)
      :type cnt: int


   .. py:method:: update_counts(tokens)

      Given a list of tokens update counts for words in the vocab.

      :param tokens: Usually a large block of text split into tokens/words.
      :type tokens: list[str]


   .. py:method:: add_word(word, cnt = 1, vec = None, replace = True)

      Add a word to the vocabulary

      :param word: The word to be added, it should be lemmatized and lowercased
      :type word: str
      :param cnt: Count of this word in your dataset (Default value = 1)
      :type cnt: int
      :param vec: The vector representation of the word (Default value = None)
      :type vec: Optional[np.ndarray]
      :param replace: Will replace old vector representation (Default value = True)
      :type replace: bool


   .. py:method:: add_words(path, replace = True)

      Adds words to the vocab from a file, the file
      is required to have the following format (vec being optional):
          <word>      <cnt>[  <vec_space_separated>]

      e.g. one line: the word house with 3 dimensional vectors
          house   34444   0.3232 0.123213 1.231231

      :param path: path to the file with words and vectors
      :type path: str
      :param replace: existing words in the vocabulary will be replaced.
                      Defaults to True.
      :type replace: bool


   .. py:method:: init_cumsums()

      Initialise cumulative sums.

      This is in place of the unigram table. But similarly to it, this
      approach allows generating a list of indices that match the
      probabilistic distribution expected as per the word counts of each
      word.


   .. py:method:: get_negative_samples(n = 6, ignore_punct_and_num = False)

      Get N negative samples.

      :param n: How many words to return (Default value = 6)
      :type n: int
      :param ignore_punct_and_num: Whether to ignore punctuation and numbers. Defaults to False.
      :type ignore_punct_and_num: bool

      :raises Exception: If no unigram table is present.

      :Returns: **list[int]** -- Indices for words in this vocabulary.


   .. py:method:: get_vectors(indices)


   .. py:method:: __getitem__(word)


   .. py:method:: vec(word)


   .. py:method:: count(word)


   .. py:method:: item(word)


   .. py:method:: __contains__(word)


   .. py:method:: __eq__(other)


   .. py:method:: save(save_path, serialiser = AvailableSerialisers.dill, overwrite = False)

      Save Vocab at path.

      :param save_path: The path to save at.
      :type save_path: str
      :param serialiser: The serialiser. Defaults to AvailableSerialisers.dill.
      :type serialiser: Union[ str, AvailableSerialisers], optional
      :param overwrite: Whether to allow overwriting existing files. Defaults to False.
      :type overwrite: bool, optional


   .. py:method:: load(path)
      :classmethod:


   .. py:method:: get_strategy()


   .. py:method:: get_init_attrs()
      :classmethod:


   .. py:method:: ignore_attrs()
      :classmethod:


   .. py:method:: include_properties()
      :classmethod:


.. py:data:: logger

.. py:function:: calc_matrix(vocab, target_size)

   Calculate the transformation matrix based on the word vectors in the
   Vocab.

   Performs Principal Component Analysis (PCA).
   This first means all the word vectors in the Vocab.
   It then finds the covariance matrix.
   After that, the eigenvalues and and eigenvectors are calculated.
   And the `target_size` eigenvectors corresponding to the largest
   eigenvalues are selected to create the transformation matrix.
   :param vocab: The Vocab.
   :type vocab: Vocab
   :param target_size: The target vector size.
   :type target_size: int

   :Returns: **np.ndarray** -- The transformation matrix.


.. py:function:: convert_vec(cur, matrix, target_dtype = np.float32)

   Helper function to convert the vector.

   This also guarantees uniform typing (of np.float32) since in our
   experience some vectors may be of a different type before (i.e np.float64).

   :param cur: The current vector.
   :type cur: np.ndarray
   :param matrix: The transformation matrix.
   :type matrix: np.ndarray
   :param target_dtype: The target element data ype.
                        Defaults to np.float32.
   :type target_dtype: Type

   :Returns: **np.ndarray** -- The transformed vector.


.. py:function:: convert_vocab(vocab, matrix)

   Use the transformation matrix to convert the word vectors.

   :param vocab: The Vocab.
   :type vocab: Vocab
   :param matrix: The transformation matrix.
   :type matrix: np.ndarray


.. py:function:: convert_context_vectors(cdb, matrix)

   Use the transformation matrix to convert the context vectors within the
   CDB.

   :param cdb: The Context Database.
   :type cdb: CDB
   :param matrix: The transformation matrix.
   :type matrix: np.ndarray


.. py:function:: convert_vocab_vector_size(cdb, vocab, vec_size)

   Convert the vocab vector size to a smaller one.

   This uses Principal Component Analysis (PCA). The idea is that we
   first center all the word vectors (in Vocab), then compute the
   covariance matrix, then find the eigenvalues and eigenvectors,
   and then we select the top `vec_size` eigenvectors.
   This produces a transformation matrix of shape (vec_size, N),
   where N is the current vector length in the vocab.

   After that, we perform the transformation. First we transform all
   the vectors in the Vocab. And then we transform all the context
   vectors defined within the CDB.

   NOTE: This requires the CDB as well since the per concept context
   vectors stored within it are based on the vectors in the vocab and
   thus they also need to be transformed.

   :param cdb: The Concept Database.
   :type cdb: CDB
   :param vocab: The Vocab.
   :type vocab: Vocab
   :param vec_size: The target vector size.
   :type vec_size: int