medcat.utils.legacy.convert_vocab
=================================

.. py:module:: medcat.utils.legacy.convert_vocab


Classes
-------

.. autoapisummary::

   medcat.utils.legacy.convert_vocab.Vocab


Functions
---------

.. autoapisummary::

   medcat.utils.legacy.convert_vocab.get_vocab_from_old


Module Contents
---------------

.. py:class:: Vocab

   Bases: :py:obj:`medcat.storage.serialisables.AbstractSerialisable`


   Vocabulary used to store word embeddings for context similarity
   calculation. Also used by the spell checker - but not for fixing the
   spelling only for checking is something correct.

   Properties:
       vocab (dict[str, WordDescriptor]):
           Map from word to attributes, e.g. {'house':
               {'vector': <np.array>, 'count': <int>, ...}, ...}
       index2word (dict[int, str]):
           From word to an index - used for negative sampling
       vec_index2word (dict):
           Same as index2word but only words that have vectors


   .. py:method:: __init__()


   .. py:attribute:: vocab
      :type:  dict[str, WordDescriptor]


   .. py:attribute:: index2word
      :type:  dict[int, str]


   .. py:attribute:: vec_index2word
      :type:  dict[int, str]


   .. py:attribute:: cum_probs
      :type:  numpy.ndarray


   .. py:method:: inc_or_add(word, cnt = 1, vec = None)

      Add a word or increase its count.

      :param word: Word to be added
      :type word: str
      :param cnt: By how much should the count be increased, or to what
                  should it be set if a new word. (Default value = 1)
      :type cnt: int
      :param vec: Word vector (Default value = None)
      :type vec: Optional[np.ndarray]


   .. py:method:: remove_all_vectors()

      Remove all stored vector representations.


   .. py:method:: remove_words_below_cnt(cnt)

      Remove all words with frequency below cnt.

      :param cnt: Word count limit.
      :type cnt: int


   .. py:method:: _rebuild_index()


   .. py:method:: inc_wc(word, cnt = 1)

      Incraese word count by cnt.

      :param word: For which word to increase the count
      :type word: str
      :param cnt: By how muhc to increase the count (Default value = 1)
      :type cnt: int


   .. py:method:: add_vec(word, vec)

      Add vector to a word.

      :param word: To which word to add the vector.
      :type word: str
      :param vec: The vector to add.
      :type vec: np.ndarray


   .. py:method:: reset_counts(cnt = 1)

      Reset the count for all word to cnt.

      :param cnt: New count for all words in the vocab. (Default value = 1)
      :type cnt: int


   .. py:method:: update_counts(tokens)

      Given a list of tokens update counts for words in the vocab.

      :param tokens: Usually a large block of text split into tokens/words.
      :type tokens: list[str]


   .. py:method:: add_word(word, cnt = 1, vec = None, replace = True)

      Add a word to the vocabulary

      :param word: The word to be added, it should be lemmatized and lowercased
      :type word: str
      :param cnt: Count of this word in your dataset (Default value = 1)
      :type cnt: int
      :param vec: The vector representation of the word (Default value = None)
      :type vec: Optional[np.ndarray]
      :param replace: Will replace old vector representation (Default value = True)
      :type replace: bool


   .. py:method:: add_words(path, replace = True)

      Adds words to the vocab from a file, the file
      is required to have the following format (vec being optional):
          <word>      <cnt>[  <vec_space_separated>]

      e.g. one line: the word house with 3 dimensional vectors
          house   34444   0.3232 0.123213 1.231231

      :param path: path to the file with words and vectors
      :type path: str
      :param replace: existing words in the vocabulary will be replaced.
                      Defaults to True.
      :type replace: bool


   .. py:method:: init_cumsums()

      Initialise cumulative sums.

      This is in place of the unigram table. But similarly to it, this
      approach allows generating a list of indices that match the
      probabilistic distribution expected as per the word counts of each
      word.


   .. py:method:: get_negative_samples(n = 6, ignore_punct_and_num = False)

      Get N negative samples.

      :param n: How many words to return (Default value = 6)
      :type n: int
      :param ignore_punct_and_num: Whether to ignore punctuation and numbers. Defaults to False.
      :type ignore_punct_and_num: bool

      :raises Exception: If no unigram table is present.

      :Returns: **list[int]** -- Indices for words in this vocabulary.


   .. py:method:: get_vectors(indices)


   .. py:method:: __getitem__(word)


   .. py:method:: vec(word)


   .. py:method:: count(word)


   .. py:method:: item(word)


   .. py:method:: __contains__(word)


   .. py:method:: __eq__(other)


   .. py:method:: save(save_path, serialiser = AvailableSerialisers.dill, overwrite = False)

      Save Vocab at path.

      :param save_path: The path to save at.
      :type save_path: str
      :param serialiser: The serialiser. Defaults to AvailableSerialisers.dill.
      :type serialiser: Union[ str, AvailableSerialisers], optional
      :param overwrite: Whether to allow overwriting existing files. Defaults to False.
      :type overwrite: bool, optional


   .. py:method:: load(path)
      :classmethod:


   .. py:method:: get_strategy()


   .. py:method:: get_init_attrs()
      :classmethod:


   .. py:method:: ignore_attrs()
      :classmethod:


   .. py:method:: include_properties()
      :classmethod:


.. py:function:: get_vocab_from_old(old_path)

   Convert a v1 vocab file to a v2 Vocab.

   :param old_path: The v1 vocab file path.
   :type old_path: str

   :Returns: **Vocab** -- The v2 Vocab.