medcat.utils.legacy.helpers
===========================

.. py:module:: medcat.utils.legacy.helpers


Attributes
----------

.. autoapisummary::

   medcat.utils.legacy.helpers.logger


Classes
-------

.. autoapisummary::

   medcat.utils.legacy.helpers.CAT
   medcat.utils.legacy.helpers.CDB
   medcat.utils.legacy.helpers.NameDescriptor


Functions
---------

.. autoapisummary::

   medcat.utils.legacy.helpers.prepare_name
   medcat.utils.legacy.helpers.has_per_concept_subnames
   medcat.utils.legacy.helpers._fix_subnames
   medcat.utils.legacy.helpers.fix_old_style_cnf
   medcat.utils.legacy.helpers.fix_subnames


Module Contents
---------------

.. py:class:: CAT(cdb, vocab = None, config = None, model_load_path = None)

   Bases: :py:obj:`medcat.storage.serialisables.AbstractSerialisable`


   This is a collection of serialisable model parts.


   .. py:method:: __init__(cdb, vocab = None, config = None, model_load_path = None)


   .. py:attribute:: cdb


   .. py:attribute:: vocab
      :value: None


   .. py:attribute:: config
      :value: None


   .. py:attribute:: _trainer
      :type:  Optional[medcat.trainer.Trainer]
      :value: None


   .. py:attribute:: _pipeline


   .. py:attribute:: usage_monitor


   .. py:method:: _recreate_pipe(model_load_path = None)


   .. py:method:: get_init_attrs()
      :classmethod:


   .. py:method:: ignore_attrs()
      :classmethod:


   .. py:method:: __call__(text)


   .. py:method:: _ensure_not_training()

      Method to ensure config is not set to train.

      `config.components.linking.train` should only be True while training
      and not during inference.
      This aalso corrects the setting if necessary.


   .. py:method:: get_entities(text: str, only_cui: Literal[False] = False) -> medcat.data.entities.Entities
                  get_entities(text: str, only_cui: Literal[True] = True) -> medcat.data.entities.OnlyCUIEntities
                  get_entities(text: str, only_cui: bool = False) -> Union[dict, medcat.data.entities.Entities, medcat.data.entities.OnlyCUIEntities]

      Get the entities recognised and linked within the provided text.

      This will run the text through the pipeline and annotated the
      recognised and linked entities.

      :param text: The text to use.
      :type text: str
      :param only_cui: Whether to only output the CUIs
                       rather than the entire context. Defaults to False.
      :type only_cui: bool, optional

      :Returns: **Union[dict, Entities, OnlyCUIEntities]** -- The entities found and
                linked within the text.


   .. py:method:: _mp_worker_func(texts_and_indices)


   .. py:method:: _generate_batches_by_char_length(text_iter, batch_size_chars, only_cui)


   .. py:method:: _generate_batches(text_iter, batch_size, batch_size_chars, only_cui)


   .. py:method:: _generate_simple_batches(text_iter, batch_size, only_cui)


   .. py:method:: _mp_one_batch_per_process(executor, batch_iter, external_processes)


   .. py:method:: get_entities_multi_texts(texts, only_cui = False, n_process = 1, batch_size = -1, batch_size_chars = 1000000)

      Get entities from multiple texts (potentially in parallel).

      If `n_process` > 1, `n_process - 1` new processes will be created
      and data will be processed on those as well as the main process in
      parallel.

      :param texts: The input text. Either an iterable of raw text or one
                    with in the format of `(text_index, text)`.
      :type texts: Union[Iterable[str], Iterable[tuple[str, str]]]
      :param only_cui: Whether to only return CUIs rather than other information
                       like start/end and annotated value. Defaults to False.
      :type only_cui: bool
      :param n_process: Number of processes to use. Defaults to 1.
      :type n_process: int
      :param batch_size: The number of texts to batch at a time. A batch of the
                         specified size will be given to each worker process.
                         Defaults to -1 and in this case the character count will
                         be used instead.
      :type batch_size: int
      :param batch_size_chars: The maximum number of characters to process in a batch.
                               Each process will be given batch of texts with a total
                               number of characters not exceeding this value. Defaults
                               to 1,000,000 characters. Set to -1 to disable.
      :type batch_size_chars: int

      :Yields: *Iterator[tuple[str, Union[dict, Entities, OnlyCUIEntities]]]* --     The results in the format of (text_index, entities).


   .. py:method:: _get_entity(ent, doc_tokens, cui)


   .. py:method:: get_addon_output(ent)

      Get the addon output for the entity.

      This includes a key-value pair for each addon that provides some.
      Sometimes same-type addons may combine their output under the same key.

      :param ent: The entity in quesiton.
      :type ent: MutableEntity

      :raises ValueError: If unable to merge multiple addon output.

      :Returns: **dict[str, dict]** -- All the addon output.


   .. py:method:: _doc_to_out_entity(ent, doc_tokens, only_cui)


   .. py:method:: _doc_to_out(doc, only_cui, out_with_text = False)


   .. py:property:: trainer

      The trainer object.


   .. py:method:: save_model_pack(target_folder, pack_name = DEFAULT_PACK_NAME, serialiser_type = 'dill', make_archive = True, only_archive = False, add_hash_to_pack_name = True, change_description = None)

      Save model pack.

      The resulting model pack name will have the hash of the model pack
      in its name if (and only if) the default model pack name is used.

      :param target_folder: The folder to save the pack in.
      :type target_folder: str
      :param pack_name: The model pack name.
                        Defaults to DEFAULT_PACK_NAME.
      :type pack_name: str, optional
      :param serialiser_type: The serialiser type. Defaults to 'dill'.
      :type serialiser_type: Union[str, AvailableSerialisers], optional
      :param make_archive: Whether to make the arhive /.zip file. Defaults to True.
      :type make_archive: bool
      :param only_archive: Whether to clear the non-compressed folder. Defaults to False.
      :type only_archive: bool
      :param add_hash_to_pack_name: Whether to add the hash to the pack name. This is only relevant
                                    if pack_name is specified. Defaults to True.
      :type add_hash_to_pack_name: bool
      :param change_description: If provided, this the description will be added to the
                                 model description. Defaults to None.
      :type change_description: Optional[str]

      :Returns: **str** -- The final model pack path.


   .. py:method:: _get_hash()


   .. py:method:: _versioning(change_description)


   .. py:method:: attempt_unpack(zip_path)
      :classmethod:


      Attempt unpack the zip to a folder and get the model pack path.

      If the folder already exists, no unpacking is done.

      :param zip_path: The ZIP path
      :type zip_path: str

      :Returns: **str** -- The model pack path


   .. py:method:: load_model_pack(model_pack_path)
      :classmethod:


      Load the model pack from file.

      :param model_pack_path: The model pack path.
      :type model_pack_path: str

      :raises ValueError: If the saved data does not represent a model pack.

      :Returns: **CAT** -- The loaded model pack.


   .. py:method:: load_cdb(model_pack_path)
      :classmethod:


      Loads the concept database from the provided model pack path

      :param model_pack_path: path to model pack, zip or dir.
      :type model_pack_path: str

      :Returns: **CDB** -- The loaded concept database


   .. py:method:: get_model_card(as_dict: Literal[True]) -> medcat.data.model_card.ModelCard
                  get_model_card(as_dict: Literal[False]) -> str

      Get the model card either a (nested) `dict` or a json string.

      :param as_dict: Whether to return as dict. Defaults to False.
      :type as_dict: bool

      :Returns: **Union[str, ModelCard]** -- The model card.


   .. py:method:: __eq__(other)


   .. py:method:: add_addon(addon)


   .. py:method:: get_strategy()


   .. py:method:: include_properties()
      :classmethod:


.. py:class:: CDB(config)

   Bases: :py:obj:`medcat.storage.serialisables.AbstractSerialisable`


   The abstract serialisable base class.

   This defines some common defaults.


   .. py:method:: __init__(config)


   .. py:attribute:: config


   .. py:attribute:: cui2info
      :type:  dict[str, medcat.cdb.concepts.CUIInfo]


   .. py:attribute:: name2info
      :type:  dict[str, medcat.cdb.concepts.NameInfo]


   .. py:attribute:: type_id2info
      :type:  dict[str, medcat.cdb.concepts.TypeInfo]


   .. py:attribute:: token_counts
      :type:  dict[str, int]


   .. py:attribute:: addl_info
      :type:  dict[str, Any]


   .. py:attribute:: _subnames
      :type:  set[str]


   .. py:attribute:: is_dirty
      :value: False


   .. py:attribute:: has_changed_names
      :value: False


   .. py:method:: get_init_attrs()
      :classmethod:


   .. py:method:: _reset_subnames()


   .. py:method:: has_subname(name)

      Whether the CDB has the specified subname.

      :param name: The subname to check.
      :type name: str

      :Returns: **bool** -- Whether the subname is present in this CDB.


   .. py:method:: get_name(cui)

      Returns preferred name if it exists, otherwise it will return
      the longest name assigned to the concept.

      :param cui: Concept ID or unique identifier in this database.
      :type cui: str

      :Returns: **str** -- The name of the concept.


   .. py:method:: weighted_average_function(step)

      Get the weighted average for steop.

      :param step: The steop.
      :type step: int

      :Returns: **float** -- The weighted average.


   .. py:method:: add_types(types)

      Add type info to CDB.

      :param types: The raw type info.
      :type types: Iterable[tuple[str, str]]


   .. py:method:: add_names(cui, names, name_status = ST.AUTOMATIC, full_build = False)

      Adds a name to an existing concept.

      :param cui: Concept ID or unique identifier in this database, all concepts
                  that have the same CUI will be merged internally.
      :type cui: str
      :param names: Names for this concept, or the value that if found in free
                    text can be linked to this concept. Names is an dict like:
                    `{name: {'tokens': tokens, 'snames': snames,
                             'raw_name': raw_name}, ...}`
                    Names should be generated by helper function
                    'medcat.preprocessing.cleaners.prepare_name'
      :type names: dict[str, NameDescriptor]
      :param name_status: One of `P`, `N`, `A`. Defaults to 'A'.
      :type name_status: str
      :param full_build: If True the dictionary self.addl_info will also be populated,
                         contains a lot of extra information about concepts, but can be
                         very memory consuming. This is not necessary for normal
                         functioning of MedCAT (Default value `False`).
      :type full_build: bool


   .. py:method:: _add_concept_names(cui, names, name_status)


   .. py:method:: _add_full_build(cui, names, ontologies, description, type_ids)


   .. py:method:: _add_concept(cui, names, ontologies, name_status, type_ids, description, full_build = False)

      Add a concept to internal Concept Database (CDB). Depending on what
      you are providing this will add a large number of properties for each
      concept.

      :param cui: Concept ID or unique identifier in this database, all concepts
                  that have the same CUI will be merged internally.
      :type cui: str
      :param names: Names for this concept, or the value that if found in free
                    text can be linked to this concept. Names is a dict like:
                    `{name: {'tokens': tokens, 'snames': snames,
                             'raw_name': raw_name}, ...}`
                    Names should be generated by helper function
                    'medcat.preprocessing.cleaners.prepare_name'
      :type names: dict[str, NameDescriptor]
      :param ontologies: ontologies in which the concept exists (e.g. SNOMEDCT, HPO)
      :type ontologies: set[str]
      :param name_status: One of `P`, `N`, `A`
      :type name_status: str
      :param type_ids: Semantic type identifier (have a look at TUIs in UMLS or
                       SNOMED-CT)
      :type type_ids: set[str]
      :param description: Description of this concept.
      :type description: str
      :param full_build: If True the dictionary self.addl_info will also be populated,
                         contains a lot of extra information about concepts, but can be
                         very memory consuming. This is not necessary for normal
                         functioning of MedCAT (Default Value `False`).
      :type full_build: bool


   .. py:method:: reset_training()

      Will remove all training efforts - in other words all embeddings
      that are learnt for concepts in the current CDB. Please note that this
      does not remove synonyms (names) that were potentially added during
      supervised/online learning.


   .. py:method:: filter_by_cui(cuis_to_keep)

      Subset the core CDB fields (dictionaries/maps).

      Note that this will potenitally keep a bit more CUIs
      then in cuis_to_keep. It will first find all names that
      link to the cuis_to_keep and then find all CUIs that
      link to those names and keep all of them.

      This also will not remove any data from cdb.addl_info -
      as this field can contain data of unknown structure.

      :param cuis_to_keep: CUIs that will be kept, the rest will be removed
                           (not completely, look above).
      :type cuis_to_keep: Collection[str]

      :raises Exception: If no snames and subsetting is not possible.


   .. py:method:: remove_cui(cui)

      This function takes a CUI and removes it the CDB.

      It also removes the CUI from name specific per_cui_status
      maps as well as well as removes all the names that do not
      correspond to any CUIs after the removal of this one.

      :param cui: The CUI to remove.
      :type cui: str


   .. py:method:: _remove_names(cui, names)

      Remove names from an existing concept - effect is this name will
      never again be used to link to this concept. This will only remove the
      name from the linker (namely name2cuis and name2cuis2status), the name
      will still be present everywhere else. Why? Because it is bothersome
      to remove it from everywhere, but could also be useful to keep the
      removed names in e.g. cui2names.

      :param cui: Concept ID or unique identifier in this database.
      :type cui: str
      :param names: Names to be removed (e.g list, set, or even a dict (in which
                    case keys will be used)).
      :type names: Iterable[str]


   .. py:method:: __eq__(other)


   .. py:method:: get_cui2count_train()


   .. py:method:: get_name2count_train()


   .. py:method:: get_hash()


   .. py:method:: get_basic_info()


   .. py:method:: save(save_path, serialiser = AvailableSerialisers.dill, overwrite = False)

      Save CDB at path.

      :param save_path: The path to save at.
      :type save_path: str
      :param serialiser: The serialiser. Defaults to AvailableSerialisers.dill.
      :type serialiser: Union[ str, AvailableSerialisers], optional
      :param overwrite: Whether to allow overwriting existing files. Defaults to False.
      :type overwrite: bool, optional


   .. py:method:: load(path)
      :classmethod:


   .. py:method:: get_strategy()


   .. py:method:: ignore_attrs()
      :classmethod:


   .. py:method:: include_properties()
      :classmethod:


.. py:function:: prepare_name(raw_name, nlp, names, configs)

   Generates different forms of a name. Will edit the provided `names`
   dictionary and add information generated from the `name`.

   :param nlp: The tokenizer.
   :type nlp: BaseTokenizer
   :param names: Dictionary of existing names for this concept in this row of a CSV.
                 The new generated name versions and other required information will
                 be added here.
   :type names: dict[str, NameDescriptor]
   :param configs: Applicable configs for medcat.
   :type configs: tuple[LGeneral, LPreprocessing, LCDBMaker]

   :Returns: **names** (*dict*) -- The updated dictionary of prepared names.


.. py:class:: NameDescriptor

   .. py:attribute:: tokens
      :type:  list[str]


   .. py:attribute:: snames
      :type:  set[str]


   .. py:attribute:: raw_name
      :type:  str


   .. py:attribute:: is_upper
      :type:  bool


.. py:data:: logger

.. py:function:: has_per_concept_subnames(cdb)

.. py:function:: _fix_subnames(cat)

.. py:function:: fix_old_style_cnf(data, remove = {'py/object', '__fields_set__', '__private_attribute_values__'}, take_from = 'py/state.__dict__')

.. py:function:: fix_subnames(cat)