medcat.utils.legacy.helpers =========================== .. py:module:: medcat.utils.legacy.helpers Attributes ---------- .. autoapisummary:: medcat.utils.legacy.helpers.logger Classes ------- .. autoapisummary:: medcat.utils.legacy.helpers.CAT medcat.utils.legacy.helpers.CDB medcat.utils.legacy.helpers.NameDescriptor Functions --------- .. autoapisummary:: medcat.utils.legacy.helpers.prepare_name medcat.utils.legacy.helpers.has_per_concept_subnames medcat.utils.legacy.helpers._fix_subnames medcat.utils.legacy.helpers.fix_old_style_cnf medcat.utils.legacy.helpers.fix_subnames Module Contents --------------- .. py:class:: CAT(cdb, vocab = None, config = None, model_load_path = None) Bases: :py:obj:`medcat.storage.serialisables.AbstractSerialisable` This is a collection of serialisable model parts. .. py:method:: __init__(cdb, vocab = None, config = None, model_load_path = None) .. py:attribute:: cdb .. py:attribute:: vocab :value: None .. py:attribute:: config :value: None .. py:attribute:: _trainer :type: Optional[medcat.trainer.Trainer] :value: None .. py:attribute:: _pipeline .. py:attribute:: usage_monitor .. py:method:: _recreate_pipe(model_load_path = None) .. py:method:: get_init_attrs() :classmethod: .. py:method:: ignore_attrs() :classmethod: .. py:method:: __call__(text) .. py:method:: _ensure_not_training() Method to ensure config is not set to train. `config.components.linking.train` should only be True while training and not during inference. This aalso corrects the setting if necessary. .. py:method:: get_entities(text: str, only_cui: Literal[False] = False) -> medcat.data.entities.Entities get_entities(text: str, only_cui: Literal[True] = True) -> medcat.data.entities.OnlyCUIEntities get_entities(text: str, only_cui: bool = False) -> Union[dict, medcat.data.entities.Entities, medcat.data.entities.OnlyCUIEntities] Get the entities recognised and linked within the provided text. This will run the text through the pipeline and annotated the recognised and linked entities. :param text: The text to use. :type text: str :param only_cui: Whether to only output the CUIs rather than the entire context. Defaults to False. :type only_cui: bool, optional :Returns: **Union[dict, Entities, OnlyCUIEntities]** -- The entities found and linked within the text. .. py:method:: _mp_worker_func(texts_and_indices) .. py:method:: _generate_batches_by_char_length(text_iter, batch_size_chars, only_cui) .. py:method:: _generate_batches(text_iter, batch_size, batch_size_chars, only_cui) .. py:method:: _generate_simple_batches(text_iter, batch_size, only_cui) .. py:method:: _mp_one_batch_per_process(executor, batch_iter, external_processes) .. py:method:: get_entities_multi_texts(texts, only_cui = False, n_process = 1, batch_size = -1, batch_size_chars = 1000000) Get entities from multiple texts (potentially in parallel). If `n_process` > 1, `n_process - 1` new processes will be created and data will be processed on those as well as the main process in parallel. :param texts: The input text. Either an iterable of raw text or one with in the format of `(text_index, text)`. :type texts: Union[Iterable[str], Iterable[tuple[str, str]]] :param only_cui: Whether to only return CUIs rather than other information like start/end and annotated value. Defaults to False. :type only_cui: bool :param n_process: Number of processes to use. Defaults to 1. :type n_process: int :param batch_size: The number of texts to batch at a time. A batch of the specified size will be given to each worker process. Defaults to -1 and in this case the character count will be used instead. :type batch_size: int :param batch_size_chars: The maximum number of characters to process in a batch. Each process will be given batch of texts with a total number of characters not exceeding this value. Defaults to 1,000,000 characters. Set to -1 to disable. :type batch_size_chars: int :Yields: *Iterator[tuple[str, Union[dict, Entities, OnlyCUIEntities]]]* -- The results in the format of (text_index, entities). .. py:method:: _get_entity(ent, doc_tokens, cui) .. py:method:: get_addon_output(ent) Get the addon output for the entity. This includes a key-value pair for each addon that provides some. Sometimes same-type addons may combine their output under the same key. :param ent: The entity in quesiton. :type ent: MutableEntity :raises ValueError: If unable to merge multiple addon output. :Returns: **dict[str, dict]** -- All the addon output. .. py:method:: _doc_to_out_entity(ent, doc_tokens, only_cui) .. py:method:: _doc_to_out(doc, only_cui, out_with_text = False) .. py:property:: trainer The trainer object. .. py:method:: save_model_pack(target_folder, pack_name = DEFAULT_PACK_NAME, serialiser_type = 'dill', make_archive = True, only_archive = False, add_hash_to_pack_name = True, change_description = None) Save model pack. The resulting model pack name will have the hash of the model pack in its name if (and only if) the default model pack name is used. :param target_folder: The folder to save the pack in. :type target_folder: str :param pack_name: The model pack name. Defaults to DEFAULT_PACK_NAME. :type pack_name: str, optional :param serialiser_type: The serialiser type. Defaults to 'dill'. :type serialiser_type: Union[str, AvailableSerialisers], optional :param make_archive: Whether to make the arhive /.zip file. Defaults to True. :type make_archive: bool :param only_archive: Whether to clear the non-compressed folder. Defaults to False. :type only_archive: bool :param add_hash_to_pack_name: Whether to add the hash to the pack name. This is only relevant if pack_name is specified. Defaults to True. :type add_hash_to_pack_name: bool :param change_description: If provided, this the description will be added to the model description. Defaults to None. :type change_description: Optional[str] :Returns: **str** -- The final model pack path. .. py:method:: _get_hash() .. py:method:: _versioning(change_description) .. py:method:: attempt_unpack(zip_path) :classmethod: Attempt unpack the zip to a folder and get the model pack path. If the folder already exists, no unpacking is done. :param zip_path: The ZIP path :type zip_path: str :Returns: **str** -- The model pack path .. py:method:: load_model_pack(model_pack_path) :classmethod: Load the model pack from file. :param model_pack_path: The model pack path. :type model_pack_path: str :raises ValueError: If the saved data does not represent a model pack. :Returns: **CAT** -- The loaded model pack. .. py:method:: load_cdb(model_pack_path) :classmethod: Loads the concept database from the provided model pack path :param model_pack_path: path to model pack, zip or dir. :type model_pack_path: str :Returns: **CDB** -- The loaded concept database .. py:method:: get_model_card(as_dict: Literal[True]) -> medcat.data.model_card.ModelCard get_model_card(as_dict: Literal[False]) -> str Get the model card either a (nested) `dict` or a json string. :param as_dict: Whether to return as dict. Defaults to False. :type as_dict: bool :Returns: **Union[str, ModelCard]** -- The model card. .. py:method:: __eq__(other) .. py:method:: add_addon(addon) .. py:method:: get_strategy() .. py:method:: include_properties() :classmethod: .. py:class:: CDB(config) Bases: :py:obj:`medcat.storage.serialisables.AbstractSerialisable` The abstract serialisable base class. This defines some common defaults. .. py:method:: __init__(config) .. py:attribute:: config .. py:attribute:: cui2info :type: dict[str, medcat.cdb.concepts.CUIInfo] .. py:attribute:: name2info :type: dict[str, medcat.cdb.concepts.NameInfo] .. py:attribute:: type_id2info :type: dict[str, medcat.cdb.concepts.TypeInfo] .. py:attribute:: token_counts :type: dict[str, int] .. py:attribute:: addl_info :type: dict[str, Any] .. py:attribute:: _subnames :type: set[str] .. py:attribute:: is_dirty :value: False .. py:attribute:: has_changed_names :value: False .. py:method:: get_init_attrs() :classmethod: .. py:method:: _reset_subnames() .. py:method:: has_subname(name) Whether the CDB has the specified subname. :param name: The subname to check. :type name: str :Returns: **bool** -- Whether the subname is present in this CDB. .. py:method:: get_name(cui) Returns preferred name if it exists, otherwise it will return the longest name assigned to the concept. :param cui: Concept ID or unique identifier in this database. :type cui: str :Returns: **str** -- The name of the concept. .. py:method:: weighted_average_function(step) Get the weighted average for steop. :param step: The steop. :type step: int :Returns: **float** -- The weighted average. .. py:method:: add_types(types) Add type info to CDB. :param types: The raw type info. :type types: Iterable[tuple[str, str]] .. py:method:: add_names(cui, names, name_status = ST.AUTOMATIC, full_build = False) Adds a name to an existing concept. :param cui: Concept ID or unique identifier in this database, all concepts that have the same CUI will be merged internally. :type cui: str :param names: Names for this concept, or the value that if found in free text can be linked to this concept. Names is an dict like: `{name: {'tokens': tokens, 'snames': snames, 'raw_name': raw_name}, ...}` Names should be generated by helper function 'medcat.preprocessing.cleaners.prepare_name' :type names: dict[str, NameDescriptor] :param name_status: One of `P`, `N`, `A`. Defaults to 'A'. :type name_status: str :param full_build: If True the dictionary self.addl_info will also be populated, contains a lot of extra information about concepts, but can be very memory consuming. This is not necessary for normal functioning of MedCAT (Default value `False`). :type full_build: bool .. py:method:: _add_concept_names(cui, names, name_status) .. py:method:: _add_full_build(cui, names, ontologies, description, type_ids) .. py:method:: _add_concept(cui, names, ontologies, name_status, type_ids, description, full_build = False) Add a concept to internal Concept Database (CDB). Depending on what you are providing this will add a large number of properties for each concept. :param cui: Concept ID or unique identifier in this database, all concepts that have the same CUI will be merged internally. :type cui: str :param names: Names for this concept, or the value that if found in free text can be linked to this concept. Names is a dict like: `{name: {'tokens': tokens, 'snames': snames, 'raw_name': raw_name}, ...}` Names should be generated by helper function 'medcat.preprocessing.cleaners.prepare_name' :type names: dict[str, NameDescriptor] :param ontologies: ontologies in which the concept exists (e.g. SNOMEDCT, HPO) :type ontologies: set[str] :param name_status: One of `P`, `N`, `A` :type name_status: str :param type_ids: Semantic type identifier (have a look at TUIs in UMLS or SNOMED-CT) :type type_ids: set[str] :param description: Description of this concept. :type description: str :param full_build: If True the dictionary self.addl_info will also be populated, contains a lot of extra information about concepts, but can be very memory consuming. This is not necessary for normal functioning of MedCAT (Default Value `False`). :type full_build: bool .. py:method:: reset_training() Will remove all training efforts - in other words all embeddings that are learnt for concepts in the current CDB. Please note that this does not remove synonyms (names) that were potentially added during supervised/online learning. .. py:method:: filter_by_cui(cuis_to_keep) Subset the core CDB fields (dictionaries/maps). Note that this will potenitally keep a bit more CUIs then in cuis_to_keep. It will first find all names that link to the cuis_to_keep and then find all CUIs that link to those names and keep all of them. This also will not remove any data from cdb.addl_info - as this field can contain data of unknown structure. :param cuis_to_keep: CUIs that will be kept, the rest will be removed (not completely, look above). :type cuis_to_keep: Collection[str] :raises Exception: If no snames and subsetting is not possible. .. py:method:: remove_cui(cui) This function takes a CUI and removes it the CDB. It also removes the CUI from name specific per_cui_status maps as well as well as removes all the names that do not correspond to any CUIs after the removal of this one. :param cui: The CUI to remove. :type cui: str .. py:method:: _remove_names(cui, names) Remove names from an existing concept - effect is this name will never again be used to link to this concept. This will only remove the name from the linker (namely name2cuis and name2cuis2status), the name will still be present everywhere else. Why? Because it is bothersome to remove it from everywhere, but could also be useful to keep the removed names in e.g. cui2names. :param cui: Concept ID or unique identifier in this database. :type cui: str :param names: Names to be removed (e.g list, set, or even a dict (in which case keys will be used)). :type names: Iterable[str] .. py:method:: __eq__(other) .. py:method:: get_cui2count_train() .. py:method:: get_name2count_train() .. py:method:: get_hash() .. py:method:: get_basic_info() .. py:method:: save(save_path, serialiser = AvailableSerialisers.dill, overwrite = False) Save CDB at path. :param save_path: The path to save at. :type save_path: str :param serialiser: The serialiser. Defaults to AvailableSerialisers.dill. :type serialiser: Union[ str, AvailableSerialisers], optional :param overwrite: Whether to allow overwriting existing files. Defaults to False. :type overwrite: bool, optional .. py:method:: load(path) :classmethod: .. py:method:: get_strategy() .. py:method:: ignore_attrs() :classmethod: .. py:method:: include_properties() :classmethod: .. py:function:: prepare_name(raw_name, nlp, names, configs) Generates different forms of a name. Will edit the provided `names` dictionary and add information generated from the `name`. :param nlp: The tokenizer. :type nlp: BaseTokenizer :param names: Dictionary of existing names for this concept in this row of a CSV. The new generated name versions and other required information will be added here. :type names: dict[str, NameDescriptor] :param configs: Applicable configs for medcat. :type configs: tuple[LGeneral, LPreprocessing, LCDBMaker] :Returns: **names** (*dict*) -- The updated dictionary of prepared names. .. py:class:: NameDescriptor .. py:attribute:: tokens :type: list[str] .. py:attribute:: snames :type: set[str] .. py:attribute:: raw_name :type: str .. py:attribute:: is_upper :type: bool .. py:data:: logger .. py:function:: has_per_concept_subnames(cdb) .. py:function:: _fix_subnames(cat) .. py:function:: fix_old_style_cnf(data, remove = {'py/object', '__fields_set__', '__private_attribute_values__'}, take_from = 'py/state.__dict__') .. py:function:: fix_subnames(cat)