medcat.components.ner.dict_based_ner
====================================

.. py:module:: medcat.components.ner.dict_based_ner


Attributes
----------

.. autoapisummary::

   medcat.components.ner.dict_based_ner._EXTRA_NAME
   medcat.components.ner.dict_based_ner.logger


Classes
-------

.. autoapisummary::

   medcat.components.ner.dict_based_ner.MutableDocument
   medcat.components.ner.dict_based_ner.CoreComponentType
   medcat.components.ner.dict_based_ner.AbstractCoreComponent
   medcat.components.ner.dict_based_ner.BaseTokenizer
   medcat.components.ner.dict_based_ner.Vocab
   medcat.components.ner.dict_based_ner.CDB
   medcat.components.ner.dict_based_ner.NER


Functions
---------

.. autoapisummary::

   medcat.components.ner.dict_based_ner.maybe_annotate_name
   medcat.components.ner.dict_based_ner.ensure_optional_extras_installed


Module Contents
---------------

.. py:class:: MutableDocument

   Bases: :py:obj:`Protocol`


   The mutable parts of the document.

   Represents parts of the document that can / should be changed
   by the various components.


   .. py:property:: base
      :type: BaseDocument


      The base document.


   .. py:property:: linked_ents
      :type: list[MutableEntity]


      The linked entities associated with the document.

      This should be set by the linker.


   .. py:property:: ner_ents
      :type: list[MutableEntity]


      All entities recognised by NER.

      This should be set by the NER component.


   .. py:method:: __iter__()


   .. py:method:: __getitem__(index: int) -> MutableToken
                  __getitem__(index: slice) -> MutableEntity


   .. py:method:: __len__()


   .. py:method:: get_tokens(start_index, end_index)

      Get the tokens that span the specified character indices.

      :param start_index: The starting character index.
      :type start_index: int
      :param end_index: The ending character index.
      :type end_index: int

      :Returns: **list[MutableToken]** -- The list of tokens.


   .. py:method:: set_addon_data(path, val)

      Used to add arbitrary data to the entity.

      This is generally used by addons to keep track of their data.

      NB! The path used needs to be registered using the
      `register_addon_path` class method.

      :param path: The data ID / path.
      :type path: str
      :param val: The value to be added.
      :type val: Any


   .. py:method:: has_addon_data(path)

      Checks whether the addon data for a specific path has been set.

      :param path: The path to check.
      :type path: str

      :Returns: **bool** -- Whether the addon data had been set.


   .. py:method:: get_addon_data(path)

      Get data added to the entity.

      See `add_data` for details.

      :param path: The data ID / path.
      :type path: str

      :Returns: **Any** -- The stored value.


   .. py:method:: get_available_addon_paths()

      Gets the available addon data paths for this document.

      This will only include paths that have values set.

      :Returns: **list[str]** -- List of available addon data paths.


   .. py:method:: register_addon_path(path, def_val = None, force = True)
      :classmethod:


      Register a custom/arbitrary data path.

      This can be used to store arbitrary data along with the entity for
      use in an addon (e.g MetaCAT).

      PS: If using this, it is important to use paths namespaced to the
      component you're using in order to avoid conflicts.

      :param path: The path to be used. Should be prefixed by component
                   name (e.g `meta_cat_id` for an ID tied to the `meta_cat` addon)
      :type path: str
      :param def_val: Default value. Defaults to `None`.
      :type def_val: Any
      :param force: Whether to forcefully add the value.
                    Defaults to True.
      :type force: bool


   .. py:attribute:: __slots__
      :value: ()


   .. py:attribute:: _is_protocol
      :value: True


   .. py:attribute:: _is_runtime_protocol
      :value: False


   .. py:method:: __init_subclass__(*args, **kwargs)
      :classmethod:


   .. py:method:: __class_getitem__(params)
      :classmethod:


.. py:class:: CoreComponentType

   Bases: :py:obj:`enum.Enum`


   Generic enumeration.

   Derive from this class to define new enumerations.


   .. py:attribute:: tagging


   .. py:attribute:: token_normalizing


   .. py:attribute:: ner


   .. py:attribute:: linking


   .. py:method:: __new__(value)


   .. py:method:: _generate_next_value_(start, count, last_values)

      Generate the next value when not given.

      name: the name of the member
      start: the initial start value or None
      count: the number of existing members
      last_value: the last value assigned or None


   .. py:method:: _missing_(value)
      :classmethod:


   .. py:method:: __repr__()


   .. py:method:: __str__()


   .. py:method:: __dir__()

      Returns all members and all public methods


   .. py:method:: __format__(format_spec)

      Returns format using actual value type unless __str__ has been overridden.


   .. py:method:: __hash__()


   .. py:method:: __reduce_ex__(proto)


   .. py:method:: name()

      The name of the Enum member.


   .. py:method:: value()

      The value of the Enum member.


.. py:class:: AbstractCoreComponent

   Bases: :py:obj:`CoreComponent`


   Base class for protocol classes.

   Protocol classes are defined as::

       class Proto(Protocol):
           def meth(self) -> int:
               ...

   Such classes are primarily used with static type checkers that recognize
   structural subtyping (static duck-typing), for example::

       class C:
           def meth(self) -> int:
               return 0

       def func(x: Proto) -> int:
           return x.meth()

       func(C())  # Passes static type check

   See PEP 544 for details. Protocol classes decorated with
   @typing.runtime_checkable act as simple-minded runtime protocols that check
   only the presence of given attributes, ignoring their type signatures.
   Protocol classes can be generic, they are defined as::

       class GenProto(Protocol[T]):
           def meth(self) -> T:
               ...


   .. py:attribute:: NAME_PREFIX
      :value: 'core_'


   .. py:property:: full_name
      :type: str


      Name with the component type (e.g ner, linking, meta).


   .. py:method:: is_core()

      Whether the component is a core component or not.

      :Returns: **bool** -- Whether this is a core component.


   .. py:method:: get_type()


   .. py:property:: name
      :type: str


      The name of the component.


   .. py:method:: __call__(doc)


   .. py:method:: get_init_args(tokenizer, cdb, vocab, model_load_path)
      :classmethod:


      Get the init arguments for the component.

      :param tokenizer: The tokenizer.
      :type tokenizer: BaseTokenizer
      :param cdb: The CDB.
      :type cdb: CDB
      :param vocab: The Vocab.
      :type vocab: Vocab
      :param model_load_path: The model load path (or None).
      :type model_load_path: Optional[str]

      :Returns: **list[Any]** -- The list of init arguments.


   .. py:method:: get_init_kwargs(tokenizer, cdb, vocab, model_load_path)
      :classmethod:


      Get init keyword arguments for the component.

      :param tokenizer: The tokenizer.
      :type tokenizer: BaseTokenizer
      :param cdb: The CDB.
      :type cdb: CDB
      :param vocab: The Vocab.
      :type vocab: Vocab
      :param model_load_path: The model load path (or None).
      :type model_load_path: Optional[str]

      :Returns: **dict[str, Any]** -- The keywrod arguments.


   .. py:attribute:: __slots__
      :value: ()


   .. py:attribute:: _is_protocol
      :value: True


   .. py:attribute:: _is_runtime_protocol
      :value: False


   .. py:method:: __init_subclass__(*args, **kwargs)
      :classmethod:


   .. py:method:: __class_getitem__(params)
      :classmethod:


.. py:function:: maybe_annotate_name(tokenizer, name, tkns, doc, cdb, config, label = 'concept')

   Given a name it will check should it be annotated based on config rules.
   If yes the annotation will be added to the doc.entities array.

   :param tokenizer: The tokenizer (probably SpaCy).
   :type tokenizer: BaseTokenizer
   :param name: The name found in the text of the document.
   :type name: str
   :param tkns: Tokens that belong to this name in the spacy document.
   :type tkns: list[MutableToken]
   :param doc: Spacy document to be annotated with named entities.
   :type doc: BaseDocument
   :param cdb: Concept database.
   :type cdb: CDB
   :param config: Global config for medcat.
   :type config: Config
   :param label: Label for this name (usually `concept` if we are using
                 a vocab based approach).
   :type label: str

   :Returns: **Optional[BaseEntity]** -- The entity, if relevant.


.. py:function:: ensure_optional_extras_installed(package_name, extra_name)

   Ensure that an optional dependency set is installed.

   :param package_name: The base package name.
   :type package_name: str
   :param extra_name: The name of the extra dependency.
   :type extra_name: str

   :raises MissingDependenciesError: If the extra dependency isn't provided.


.. py:class:: BaseTokenizer

   Bases: :py:obj:`Protocol`


   The base tokenizer protocol.


   .. py:method:: create_entity(doc, token_start_index, token_end_index, label)

      Create an entity from a document.

      :param doc: The document to use.
      :type doc: MutableDocument
      :param token_start_index: The token start index.
      :type token_start_index: int
      :param token_end_index: The token end index.
      :type token_end_index: int
      :param label: The label.
      :type label: str

      :Returns: **MutableEntity** -- The resulting entity.


   .. py:method:: entity_from_tokens(tokens)

      Get an entity from the list of tokens.

      :param tokens: List of tokens.
      :type tokens: list[MutableToken]

      :Returns: **MutableEntity** -- The resulting entity.


   .. py:method:: __call__(text)


   .. py:method:: get_init_args(config)
      :classmethod:


   .. py:method:: get_init_kwargs(config)
      :classmethod:


   .. py:method:: get_doc_class()

      Get the document implementation class used by the tokenizer.

      This can be used (e.g) to register addon paths.

      :Returns: **Type[MutableDocument]** -- The document class.


   .. py:method:: get_entity_class()

      Get the entity implementation class used by the tokenizer.

      :Returns: **Type[MutableEntity]** -- The entity class.


   .. py:attribute:: __slots__
      :value: ()


   .. py:attribute:: _is_protocol
      :value: True


   .. py:attribute:: _is_runtime_protocol
      :value: False


   .. py:method:: __init_subclass__(*args, **kwargs)
      :classmethod:


   .. py:method:: __class_getitem__(params)
      :classmethod:


.. py:class:: Vocab

   Bases: :py:obj:`medcat.storage.serialisables.AbstractSerialisable`


   Vocabulary used to store word embeddings for context similarity
   calculation. Also used by the spell checker - but not for fixing the
   spelling only for checking is something correct.

   Properties:
       vocab (dict[str, WordDescriptor]):
           Map from word to attributes, e.g. {'house':
               {'vector': <np.array>, 'count': <int>, ...}, ...}
       index2word (dict[int, str]):
           From word to an index - used for negative sampling
       vec_index2word (dict):
           Same as index2word but only words that have vectors


   .. py:method:: __init__()


   .. py:attribute:: vocab
      :type:  dict[str, WordDescriptor]


   .. py:attribute:: index2word
      :type:  dict[int, str]


   .. py:attribute:: vec_index2word
      :type:  dict[int, str]


   .. py:attribute:: cum_probs
      :type:  numpy.ndarray


   .. py:method:: inc_or_add(word, cnt = 1, vec = None)

      Add a word or increase its count.

      :param word: Word to be added
      :type word: str
      :param cnt: By how much should the count be increased, or to what
                  should it be set if a new word. (Default value = 1)
      :type cnt: int
      :param vec: Word vector (Default value = None)
      :type vec: Optional[np.ndarray]


   .. py:method:: remove_all_vectors()

      Remove all stored vector representations.


   .. py:method:: remove_words_below_cnt(cnt)

      Remove all words with frequency below cnt.

      :param cnt: Word count limit.
      :type cnt: int


   .. py:method:: _rebuild_index()


   .. py:method:: inc_wc(word, cnt = 1)

      Incraese word count by cnt.

      :param word: For which word to increase the count
      :type word: str
      :param cnt: By how muhc to increase the count (Default value = 1)
      :type cnt: int


   .. py:method:: add_vec(word, vec)

      Add vector to a word.

      :param word: To which word to add the vector.
      :type word: str
      :param vec: The vector to add.
      :type vec: np.ndarray


   .. py:method:: reset_counts(cnt = 1)

      Reset the count for all word to cnt.

      :param cnt: New count for all words in the vocab. (Default value = 1)
      :type cnt: int


   .. py:method:: update_counts(tokens)

      Given a list of tokens update counts for words in the vocab.

      :param tokens: Usually a large block of text split into tokens/words.
      :type tokens: list[str]


   .. py:method:: add_word(word, cnt = 1, vec = None, replace = True)

      Add a word to the vocabulary

      :param word: The word to be added, it should be lemmatized and lowercased
      :type word: str
      :param cnt: Count of this word in your dataset (Default value = 1)
      :type cnt: int
      :param vec: The vector representation of the word (Default value = None)
      :type vec: Optional[np.ndarray]
      :param replace: Will replace old vector representation (Default value = True)
      :type replace: bool


   .. py:method:: add_words(path, replace = True)

      Adds words to the vocab from a file, the file
      is required to have the following format (vec being optional):
          <word>      <cnt>[  <vec_space_separated>]

      e.g. one line: the word house with 3 dimensional vectors
          house   34444   0.3232 0.123213 1.231231

      :param path: path to the file with words and vectors
      :type path: str
      :param replace: existing words in the vocabulary will be replaced.
                      Defaults to True.
      :type replace: bool


   .. py:method:: init_cumsums()

      Initialise cumulative sums.

      This is in place of the unigram table. But similarly to it, this
      approach allows generating a list of indices that match the
      probabilistic distribution expected as per the word counts of each
      word.


   .. py:method:: get_negative_samples(n = 6, ignore_punct_and_num = False)

      Get N negative samples.

      :param n: How many words to return (Default value = 6)
      :type n: int
      :param ignore_punct_and_num: Whether to ignore punctuation and numbers. Defaults to False.
      :type ignore_punct_and_num: bool

      :raises Exception: If no unigram table is present.

      :Returns: **list[int]** -- Indices for words in this vocabulary.


   .. py:method:: get_vectors(indices)


   .. py:method:: __getitem__(word)


   .. py:method:: vec(word)


   .. py:method:: count(word)


   .. py:method:: item(word)


   .. py:method:: __contains__(word)


   .. py:method:: __eq__(other)


   .. py:method:: save(save_path, serialiser = AvailableSerialisers.dill, overwrite = False)

      Save Vocab at path.

      :param save_path: The path to save at.
      :type save_path: str
      :param serialiser: The serialiser. Defaults to AvailableSerialisers.dill.
      :type serialiser: Union[ str, AvailableSerialisers], optional
      :param overwrite: Whether to allow overwriting existing files. Defaults to False.
      :type overwrite: bool, optional


   .. py:method:: load(path)
      :classmethod:


   .. py:method:: get_strategy()


   .. py:method:: get_init_attrs()
      :classmethod:


   .. py:method:: ignore_attrs()
      :classmethod:


   .. py:method:: include_properties()
      :classmethod:


.. py:class:: CDB(config)

   Bases: :py:obj:`medcat.storage.serialisables.AbstractSerialisable`


   The abstract serialisable base class.

   This defines some common defaults.


   .. py:method:: __init__(config)


   .. py:attribute:: config


   .. py:attribute:: cui2info
      :type:  dict[str, medcat.cdb.concepts.CUIInfo]


   .. py:attribute:: name2info
      :type:  dict[str, medcat.cdb.concepts.NameInfo]


   .. py:attribute:: type_id2info
      :type:  dict[str, medcat.cdb.concepts.TypeInfo]


   .. py:attribute:: token_counts
      :type:  dict[str, int]


   .. py:attribute:: addl_info
      :type:  dict[str, Any]


   .. py:attribute:: _subnames
      :type:  set[str]


   .. py:attribute:: is_dirty
      :value: False


   .. py:attribute:: has_changed_names
      :value: False


   .. py:method:: get_init_attrs()
      :classmethod:


   .. py:method:: _reset_subnames()


   .. py:method:: has_subname(name)

      Whether the CDB has the specified subname.

      :param name: The subname to check.
      :type name: str

      :Returns: **bool** -- Whether the subname is present in this CDB.


   .. py:method:: get_name(cui)

      Returns preferred name if it exists, otherwise it will return
      the longest name assigned to the concept.

      :param cui: Concept ID or unique identifier in this database.
      :type cui: str

      :Returns: **str** -- The name of the concept.


   .. py:method:: weighted_average_function(step)

      Get the weighted average for steop.

      :param step: The steop.
      :type step: int

      :Returns: **float** -- The weighted average.


   .. py:method:: add_types(types)

      Add type info to CDB.

      :param types: The raw type info.
      :type types: Iterable[tuple[str, str]]


   .. py:method:: add_names(cui, names, name_status = ST.AUTOMATIC, full_build = False)

      Adds a name to an existing concept.

      :param cui: Concept ID or unique identifier in this database, all concepts
                  that have the same CUI will be merged internally.
      :type cui: str
      :param names: Names for this concept, or the value that if found in free
                    text can be linked to this concept. Names is an dict like:
                    `{name: {'tokens': tokens, 'snames': snames,
                             'raw_name': raw_name}, ...}`
                    Names should be generated by helper function
                    'medcat.preprocessing.cleaners.prepare_name'
      :type names: dict[str, NameDescriptor]
      :param name_status: One of `P`, `N`, `A`. Defaults to 'A'.
      :type name_status: str
      :param full_build: If True the dictionary self.addl_info will also be populated,
                         contains a lot of extra information about concepts, but can be
                         very memory consuming. This is not necessary for normal
                         functioning of MedCAT (Default value `False`).
      :type full_build: bool


   .. py:method:: _add_concept_names(cui, names, name_status)


   .. py:method:: _add_full_build(cui, names, ontologies, description, type_ids)


   .. py:method:: _add_concept(cui, names, ontologies, name_status, type_ids, description, full_build = False)

      Add a concept to internal Concept Database (CDB). Depending on what
      you are providing this will add a large number of properties for each
      concept.

      :param cui: Concept ID or unique identifier in this database, all concepts
                  that have the same CUI will be merged internally.
      :type cui: str
      :param names: Names for this concept, or the value that if found in free
                    text can be linked to this concept. Names is a dict like:
                    `{name: {'tokens': tokens, 'snames': snames,
                             'raw_name': raw_name}, ...}`
                    Names should be generated by helper function
                    'medcat.preprocessing.cleaners.prepare_name'
      :type names: dict[str, NameDescriptor]
      :param ontologies: ontologies in which the concept exists (e.g. SNOMEDCT, HPO)
      :type ontologies: set[str]
      :param name_status: One of `P`, `N`, `A`
      :type name_status: str
      :param type_ids: Semantic type identifier (have a look at TUIs in UMLS or
                       SNOMED-CT)
      :type type_ids: set[str]
      :param description: Description of this concept.
      :type description: str
      :param full_build: If True the dictionary self.addl_info will also be populated,
                         contains a lot of extra information about concepts, but can be
                         very memory consuming. This is not necessary for normal
                         functioning of MedCAT (Default Value `False`).
      :type full_build: bool


   .. py:method:: reset_training()

      Will remove all training efforts - in other words all embeddings
      that are learnt for concepts in the current CDB. Please note that this
      does not remove synonyms (names) that were potentially added during
      supervised/online learning.


   .. py:method:: filter_by_cui(cuis_to_keep)

      Subset the core CDB fields (dictionaries/maps).

      Note that this will potenitally keep a bit more CUIs
      then in cuis_to_keep. It will first find all names that
      link to the cuis_to_keep and then find all CUIs that
      link to those names and keep all of them.

      This also will not remove any data from cdb.addl_info -
      as this field can contain data of unknown structure.

      :param cuis_to_keep: CUIs that will be kept, the rest will be removed
                           (not completely, look above).
      :type cuis_to_keep: Collection[str]

      :raises Exception: If no snames and subsetting is not possible.


   .. py:method:: remove_cui(cui)

      This function takes a CUI and removes it the CDB.

      It also removes the CUI from name specific per_cui_status
      maps as well as well as removes all the names that do not
      correspond to any CUIs after the removal of this one.

      :param cui: The CUI to remove.
      :type cui: str


   .. py:method:: _remove_names(cui, names)

      Remove names from an existing concept - effect is this name will
      never again be used to link to this concept. This will only remove the
      name from the linker (namely name2cuis and name2cuis2status), the name
      will still be present everywhere else. Why? Because it is bothersome
      to remove it from everywhere, but could also be useful to keep the
      removed names in e.g. cui2names.

      :param cui: Concept ID or unique identifier in this database.
      :type cui: str
      :param names: Names to be removed (e.g list, set, or even a dict (in which
                    case keys will be used)).
      :type names: Iterable[str]


   .. py:method:: __eq__(other)


   .. py:method:: get_cui2count_train()


   .. py:method:: get_name2count_train()


   .. py:method:: get_hash()


   .. py:method:: get_basic_info()


   .. py:method:: save(save_path, serialiser = AvailableSerialisers.dill, overwrite = False)

      Save CDB at path.

      :param save_path: The path to save at.
      :type save_path: str
      :param serialiser: The serialiser. Defaults to AvailableSerialisers.dill.
      :type serialiser: Union[ str, AvailableSerialisers], optional
      :param overwrite: Whether to allow overwriting existing files. Defaults to False.
      :type overwrite: bool, optional


   .. py:method:: load(path)
      :classmethod:


   .. py:method:: get_strategy()


   .. py:method:: ignore_attrs()
      :classmethod:


   .. py:method:: include_properties()
      :classmethod:


.. py:data:: _EXTRA_NAME
   :value: 'dict-ner'


.. py:data:: logger

.. py:class:: NER(tokenizer, cdb)

   Bases: :py:obj:`medcat.components.types.AbstractCoreComponent`


   Base class for protocol classes.

   Protocol classes are defined as::

       class Proto(Protocol):
           def meth(self) -> int:
               ...

   Such classes are primarily used with static type checkers that recognize
   structural subtyping (static duck-typing), for example::

       class C:
           def meth(self) -> int:
               return 0

       def func(x: Proto) -> int:
           return x.meth()

       func(C())  # Passes static type check

   See PEP 544 for details. Protocol classes decorated with
   @typing.runtime_checkable act as simple-minded runtime protocols that check
   only the presence of given attributes, ignoring their type signatures.
   Protocol classes can be generic, they are defined as::

       class GenProto(Protocol[T]):
           def meth(self) -> T:
               ...


   .. py:attribute:: name
      :value: 'cat_dict_ner'


      The name of the component.


   .. py:method:: __init__(tokenizer, cdb)


   .. py:attribute:: tokenizer


   .. py:attribute:: cdb


   .. py:attribute:: config


   .. py:attribute:: automaton


   .. py:method:: _rebuild_automaton()


   .. py:method:: get_type()


   .. py:method:: __call__(doc)

      Detect candidates for concepts - linker will then be able
      to do the rest. It adds `entities` to the doc.entities and each
      entity can have the entity.link_candidates - that the linker
      will resolve.

      :param doc: Spacy document to be annotated with named entities.
      :type doc: MutableDocument

      :Returns: **doc** (*MutableDocument*) -- Spacy document with detected entities.


   .. py:method:: get_init_args(tokenizer, cdb, vocab, model_load_path)
      :classmethod:


      Get the init arguments for the component.

      :param tokenizer: The tokenizer.
      :type tokenizer: BaseTokenizer
      :param cdb: The CDB.
      :type cdb: CDB
      :param vocab: The Vocab.
      :type vocab: Vocab
      :param model_load_path: The model load path (or None).
      :type model_load_path: Optional[str]

      :Returns: **list[Any]** -- The list of init arguments.


   .. py:method:: get_init_kwargs(tokenizer, cdb, vocab, model_load_path)
      :classmethod:


      Get init keyword arguments for the component.

      :param tokenizer: The tokenizer.
      :type tokenizer: BaseTokenizer
      :param cdb: The CDB.
      :type cdb: CDB
      :param vocab: The Vocab.
      :type vocab: Vocab
      :param model_load_path: The model load path (or None).
      :type model_load_path: Optional[str]

      :Returns: **dict[str, Any]** -- The keywrod arguments.


   .. py:attribute:: NAME_PREFIX
      :value: 'core_'


   .. py:property:: full_name
      :type: str


      Name with the component type (e.g ner, linking, meta).


   .. py:method:: is_core()

      Whether the component is a core component or not.

      :Returns: **bool** -- Whether this is a core component.


   .. py:attribute:: __slots__
      :value: ()


   .. py:attribute:: _is_protocol
      :value: True


   .. py:attribute:: _is_runtime_protocol
      :value: False


   .. py:method:: __init_subclass__(*args, **kwargs)
      :classmethod:


   .. py:method:: __class_getitem__(params)
      :classmethod: