medcat.vocab
============

.. py:module:: medcat.vocab


Attributes
----------

.. autoapisummary::

   medcat.vocab.WordDescriptor


Classes
-------

.. autoapisummary::

   medcat.vocab.AbstractSerialisable
   medcat.vocab.AvailableSerialisers
   medcat.vocab.Vocab


Functions
---------

.. autoapisummary::

   medcat.vocab.deserialise
   medcat.vocab.serialise


Module Contents
---------------

.. py:class:: AbstractSerialisable

   The abstract serialisable base class.

   This defines some common defaults.


   .. py:method:: get_strategy()


   .. py:method:: get_init_attrs()
      :classmethod:


   .. py:method:: ignore_attrs()
      :classmethod:


   .. py:method:: include_properties()
      :classmethod:


   .. py:method:: __eq__(other)


.. py:function:: deserialise(folder_path, ignore_folders_prefix = set(), ignore_folders_suffix = set(), **init_kwargs)

   Deserialise contents of a folder.

   Extra init keyword arguments can be provided if needed.
   These are generally:
   - cnf: The config relevant to the components
   - tokenizer (BaseTokenizer): The base tokenizer for the model
   - cdb (CDB): The CDB for the model
   - vocab (Vocab): The Vocab for the model
   - model_load_path (Optional[str]): The model load path,
       but not the component load path

   This method finds the serialiser to be used based on the files on disk.

   :param folder_path: The folder to serialise.
   :type folder_path: str
   :param ignore_folders_prefix: The prefixes of folders to ignore.
   :type ignore_folders_prefix: set[str]
   :param ignore_folders_suffix: The suffixes of folders to ignore.
   :type ignore_folders_suffix: set[str]

   :Returns: **Serialisable** -- The deserialised object.


.. py:class:: AvailableSerialisers

   Bases: :py:obj:`enum.Enum`


   Describes the available serialisers.


   .. py:attribute:: dill


   .. py:attribute:: json


   .. py:method:: write_to(file_path)


   .. py:method:: from_file(file_path)
      :classmethod:


   .. py:method:: __new__(value)


   .. py:method:: _generate_next_value_(start, count, last_values)

      Generate the next value when not given.

      name: the name of the member
      start: the initial start value or None
      count: the number of existing members
      last_value: the last value assigned or None


   .. py:method:: _missing_(value)
      :classmethod:


   .. py:method:: __repr__()


   .. py:method:: __str__()


   .. py:method:: __dir__()

      Returns all members and all public methods


   .. py:method:: __format__(format_spec)

      Returns format using actual value type unless __str__ has been overridden.


   .. py:method:: __hash__()


   .. py:method:: __reduce_ex__(proto)


   .. py:method:: name()

      The name of the Enum member.


   .. py:method:: value()

      The value of the Enum member.


.. py:function:: serialise(serialiser_type, obj, target_folder, overwrite = False)

   Serialise an object based on the specified serialiser type.

   :param serialiser_type: The serialiser type.
   :type serialiser_type: Union[str, AvailableSerialisers]
   :param obj: The object to serialise.
   :type obj: Serialisable
   :param target_folder: The folder to serialise into.
   :type target_folder: str
   :param overwrite: Whether to allow overwriting. Defaults to False.
   :type overwrite: bool


.. py:data:: WordDescriptor

.. py:class:: Vocab

   Bases: :py:obj:`medcat.storage.serialisables.AbstractSerialisable`


   Vocabulary used to store word embeddings for context similarity
   calculation. Also used by the spell checker - but not for fixing the
   spelling only for checking is something correct.

   Properties:
       vocab (dict[str, WordDescriptor]):
           Map from word to attributes, e.g. {'house':
               {'vector': <np.array>, 'count': <int>, ...}, ...}
       index2word (dict[int, str]):
           From word to an index - used for negative sampling
       vec_index2word (dict):
           Same as index2word but only words that have vectors


   .. py:method:: __init__()


   .. py:attribute:: vocab
      :type:  dict[str, WordDescriptor]


   .. py:attribute:: index2word
      :type:  dict[int, str]


   .. py:attribute:: vec_index2word
      :type:  dict[int, str]


   .. py:attribute:: cum_probs
      :type:  numpy.ndarray


   .. py:method:: inc_or_add(word, cnt = 1, vec = None)

      Add a word or increase its count.

      :param word: Word to be added
      :type word: str
      :param cnt: By how much should the count be increased, or to what
                  should it be set if a new word. (Default value = 1)
      :type cnt: int
      :param vec: Word vector (Default value = None)
      :type vec: Optional[np.ndarray]


   .. py:method:: remove_all_vectors()

      Remove all stored vector representations.


   .. py:method:: remove_words_below_cnt(cnt)

      Remove all words with frequency below cnt.

      :param cnt: Word count limit.
      :type cnt: int


   .. py:method:: _rebuild_index()


   .. py:method:: inc_wc(word, cnt = 1)

      Incraese word count by cnt.

      :param word: For which word to increase the count
      :type word: str
      :param cnt: By how muhc to increase the count (Default value = 1)
      :type cnt: int


   .. py:method:: add_vec(word, vec)

      Add vector to a word.

      :param word: To which word to add the vector.
      :type word: str
      :param vec: The vector to add.
      :type vec: np.ndarray


   .. py:method:: reset_counts(cnt = 1)

      Reset the count for all word to cnt.

      :param cnt: New count for all words in the vocab. (Default value = 1)
      :type cnt: int


   .. py:method:: update_counts(tokens)

      Given a list of tokens update counts for words in the vocab.

      :param tokens: Usually a large block of text split into tokens/words.
      :type tokens: list[str]


   .. py:method:: add_word(word, cnt = 1, vec = None, replace = True)

      Add a word to the vocabulary

      :param word: The word to be added, it should be lemmatized and lowercased
      :type word: str
      :param cnt: Count of this word in your dataset (Default value = 1)
      :type cnt: int
      :param vec: The vector representation of the word (Default value = None)
      :type vec: Optional[np.ndarray]
      :param replace: Will replace old vector representation (Default value = True)
      :type replace: bool


   .. py:method:: add_words(path, replace = True)

      Adds words to the vocab from a file, the file
      is required to have the following format (vec being optional):
          <word>      <cnt>[  <vec_space_separated>]

      e.g. one line: the word house with 3 dimensional vectors
          house   34444   0.3232 0.123213 1.231231

      :param path: path to the file with words and vectors
      :type path: str
      :param replace: existing words in the vocabulary will be replaced.
                      Defaults to True.
      :type replace: bool


   .. py:method:: init_cumsums()

      Initialise cumulative sums.

      This is in place of the unigram table. But similarly to it, this
      approach allows generating a list of indices that match the
      probabilistic distribution expected as per the word counts of each
      word.


   .. py:method:: get_negative_samples(n = 6, ignore_punct_and_num = False)

      Get N negative samples.

      :param n: How many words to return (Default value = 6)
      :type n: int
      :param ignore_punct_and_num: Whether to ignore punctuation and numbers. Defaults to False.
      :type ignore_punct_and_num: bool

      :raises Exception: If no unigram table is present.

      :Returns: **list[int]** -- Indices for words in this vocabulary.


   .. py:method:: get_vectors(indices)


   .. py:method:: __getitem__(word)


   .. py:method:: vec(word)


   .. py:method:: count(word)


   .. py:method:: item(word)


   .. py:method:: __contains__(word)


   .. py:method:: __eq__(other)


   .. py:method:: save(save_path, serialiser = AvailableSerialisers.dill, overwrite = False)

      Save Vocab at path.

      :param save_path: The path to save at.
      :type save_path: str
      :param serialiser: The serialiser. Defaults to AvailableSerialisers.dill.
      :type serialiser: Union[ str, AvailableSerialisers], optional
      :param overwrite: Whether to allow overwriting existing files. Defaults to False.
      :type overwrite: bool, optional


   .. py:method:: load(path)
      :classmethod:


   .. py:method:: get_strategy()


   .. py:method:: get_init_attrs()
      :classmethod:


   .. py:method:: ignore_attrs()
      :classmethod:


   .. py:method:: include_properties()
      :classmethod: