medcat.utils.legacy.conversion_all ================================== .. py:module:: medcat.utils.legacy.conversion_all Attributes ---------- .. autoapisummary:: medcat.utils.legacy.conversion_all.logger Classes ------- .. autoapisummary:: medcat.utils.legacy.conversion_all.CAT medcat.utils.legacy.conversion_all.CoreComponentType medcat.utils.legacy.conversion_all.AvailableSerialisers medcat.utils.legacy.conversion_all.NoActionLinker medcat.utils.legacy.conversion_all.Converter Functions --------- .. autoapisummary:: medcat.utils.legacy.conversion_all.get_cdb_from_old medcat.utils.legacy.conversion_all.get_config_from_old medcat.utils.legacy.conversion_all.get_vocab_from_old medcat.utils.legacy.conversion_all.get_meta_cat_from_old medcat.utils.legacy.conversion_all.get_rel_cat_from_old medcat.utils.legacy.conversion_all.get_trf_ner_from_old medcat.utils.legacy.conversion_all.fix_subnames medcat.utils.legacy.conversion_all.unpack Module Contents --------------- .. py:class:: CAT(cdb, vocab = None, config = None, model_load_path = None) Bases: :py:obj:`medcat.storage.serialisables.AbstractSerialisable` This is a collection of serialisable model parts. .. py:method:: __init__(cdb, vocab = None, config = None, model_load_path = None) .. py:attribute:: cdb .. py:attribute:: vocab :value: None .. py:attribute:: config :value: None .. py:attribute:: _trainer :type: Optional[medcat.trainer.Trainer] :value: None .. py:attribute:: _pipeline .. py:attribute:: usage_monitor .. py:method:: _recreate_pipe(model_load_path = None) .. py:method:: get_init_attrs() :classmethod: .. py:method:: ignore_attrs() :classmethod: .. py:method:: __call__(text) .. py:method:: _ensure_not_training() Method to ensure config is not set to train. `config.components.linking.train` should only be True while training and not during inference. This aalso corrects the setting if necessary. .. py:method:: get_entities(text: str, only_cui: Literal[False] = False) -> medcat.data.entities.Entities get_entities(text: str, only_cui: Literal[True] = True) -> medcat.data.entities.OnlyCUIEntities get_entities(text: str, only_cui: bool = False) -> Union[dict, medcat.data.entities.Entities, medcat.data.entities.OnlyCUIEntities] Get the entities recognised and linked within the provided text. This will run the text through the pipeline and annotated the recognised and linked entities. :param text: The text to use. :type text: str :param only_cui: Whether to only output the CUIs rather than the entire context. Defaults to False. :type only_cui: bool, optional :Returns: **Union[dict, Entities, OnlyCUIEntities]** -- The entities found and linked within the text. .. py:method:: _mp_worker_func(texts_and_indices) .. py:method:: _generate_batches_by_char_length(text_iter, batch_size_chars, only_cui) .. py:method:: _generate_batches(text_iter, batch_size, batch_size_chars, only_cui) .. py:method:: _generate_simple_batches(text_iter, batch_size, only_cui) .. py:method:: _mp_one_batch_per_process(executor, batch_iter, external_processes) .. py:method:: get_entities_multi_texts(texts, only_cui = False, n_process = 1, batch_size = -1, batch_size_chars = 1000000) Get entities from multiple texts (potentially in parallel). If `n_process` > 1, `n_process - 1` new processes will be created and data will be processed on those as well as the main process in parallel. :param texts: The input text. Either an iterable of raw text or one with in the format of `(text_index, text)`. :type texts: Union[Iterable[str], Iterable[tuple[str, str]]] :param only_cui: Whether to only return CUIs rather than other information like start/end and annotated value. Defaults to False. :type only_cui: bool :param n_process: Number of processes to use. Defaults to 1. :type n_process: int :param batch_size: The number of texts to batch at a time. A batch of the specified size will be given to each worker process. Defaults to -1 and in this case the character count will be used instead. :type batch_size: int :param batch_size_chars: The maximum number of characters to process in a batch. Each process will be given batch of texts with a total number of characters not exceeding this value. Defaults to 1,000,000 characters. Set to -1 to disable. :type batch_size_chars: int :Yields: *Iterator[tuple[str, Union[dict, Entities, OnlyCUIEntities]]]* -- The results in the format of (text_index, entities). .. py:method:: _get_entity(ent, doc_tokens, cui) .. py:method:: get_addon_output(ent) Get the addon output for the entity. This includes a key-value pair for each addon that provides some. Sometimes same-type addons may combine their output under the same key. :param ent: The entity in quesiton. :type ent: MutableEntity :raises ValueError: If unable to merge multiple addon output. :Returns: **dict[str, dict]** -- All the addon output. .. py:method:: _doc_to_out_entity(ent, doc_tokens, only_cui) .. py:method:: _doc_to_out(doc, only_cui, out_with_text = False) .. py:property:: trainer The trainer object. .. py:method:: save_model_pack(target_folder, pack_name = DEFAULT_PACK_NAME, serialiser_type = 'dill', make_archive = True, only_archive = False, add_hash_to_pack_name = True, change_description = None) Save model pack. The resulting model pack name will have the hash of the model pack in its name if (and only if) the default model pack name is used. :param target_folder: The folder to save the pack in. :type target_folder: str :param pack_name: The model pack name. Defaults to DEFAULT_PACK_NAME. :type pack_name: str, optional :param serialiser_type: The serialiser type. Defaults to 'dill'. :type serialiser_type: Union[str, AvailableSerialisers], optional :param make_archive: Whether to make the arhive /.zip file. Defaults to True. :type make_archive: bool :param only_archive: Whether to clear the non-compressed folder. Defaults to False. :type only_archive: bool :param add_hash_to_pack_name: Whether to add the hash to the pack name. This is only relevant if pack_name is specified. Defaults to True. :type add_hash_to_pack_name: bool :param change_description: If provided, this the description will be added to the model description. Defaults to None. :type change_description: Optional[str] :Returns: **str** -- The final model pack path. .. py:method:: _get_hash() .. py:method:: _versioning(change_description) .. py:method:: attempt_unpack(zip_path) :classmethod: Attempt unpack the zip to a folder and get the model pack path. If the folder already exists, no unpacking is done. :param zip_path: The ZIP path :type zip_path: str :Returns: **str** -- The model pack path .. py:method:: load_model_pack(model_pack_path) :classmethod: Load the model pack from file. :param model_pack_path: The model pack path. :type model_pack_path: str :raises ValueError: If the saved data does not represent a model pack. :Returns: **CAT** -- The loaded model pack. .. py:method:: load_cdb(model_pack_path) :classmethod: Loads the concept database from the provided model pack path :param model_pack_path: path to model pack, zip or dir. :type model_pack_path: str :Returns: **CDB** -- The loaded concept database .. py:method:: get_model_card(as_dict: Literal[True]) -> medcat.data.model_card.ModelCard get_model_card(as_dict: Literal[False]) -> str Get the model card either a (nested) `dict` or a json string. :param as_dict: Whether to return as dict. Defaults to False. :type as_dict: bool :Returns: **Union[str, ModelCard]** -- The model card. .. py:method:: __eq__(other) .. py:method:: add_addon(addon) .. py:method:: get_strategy() .. py:method:: include_properties() :classmethod: .. py:class:: CoreComponentType Bases: :py:obj:`enum.Enum` Generic enumeration. Derive from this class to define new enumerations. .. py:attribute:: tagging .. py:attribute:: token_normalizing .. py:attribute:: ner .. py:attribute:: linking .. py:method:: __new__(value) .. py:method:: _generate_next_value_(start, count, last_values) Generate the next value when not given. name: the name of the member start: the initial start value or None count: the number of existing members last_value: the last value assigned or None .. py:method:: _missing_(value) :classmethod: .. py:method:: __repr__() .. py:method:: __str__() .. py:method:: __dir__() Returns all members and all public methods .. py:method:: __format__(format_spec) Returns format using actual value type unless __str__ has been overridden. .. py:method:: __hash__() .. py:method:: __reduce_ex__(proto) .. py:method:: name() The name of the Enum member. .. py:method:: value() The value of the Enum member. .. py:class:: AvailableSerialisers Bases: :py:obj:`enum.Enum` Describes the available serialisers. .. py:attribute:: dill .. py:attribute:: json .. py:method:: write_to(file_path) .. py:method:: from_file(file_path) :classmethod: .. py:method:: __new__(value) .. py:method:: _generate_next_value_(start, count, last_values) Generate the next value when not given. name: the name of the member start: the initial start value or None count: the number of existing members last_value: the last value assigned or None .. py:method:: _missing_(value) :classmethod: .. py:method:: __repr__() .. py:method:: __str__() .. py:method:: __dir__() Returns all members and all public methods .. py:method:: __format__(format_spec) Returns format using actual value type unless __str__ has been overridden. .. py:method:: __hash__() .. py:method:: __reduce_ex__(proto) .. py:method:: name() The name of the Enum member. .. py:method:: value() The value of the Enum member. .. py:class:: NoActionLinker Bases: :py:obj:`medcat.components.types.AbstractCoreComponent` Base class for protocol classes. Protocol classes are defined as:: class Proto(Protocol): def meth(self) -> int: ... Such classes are primarily used with static type checkers that recognize structural subtyping (static duck-typing), for example:: class C: def meth(self) -> int: return 0 def func(x: Proto) -> int: return x.meth() func(C()) # Passes static type check See PEP 544 for details. Protocol classes decorated with @typing.runtime_checkable act as simple-minded runtime protocols that check only the presence of given attributes, ignoring their type signatures. Protocol classes can be generic, they are defined as:: class GenProto(Protocol[T]): def meth(self) -> T: ... .. py:attribute:: name :value: 'no_action' The name of the component. .. py:method:: get_type() .. py:method:: __call__(doc) .. py:method:: create_new_component(cnf, tokenizer, cdb, vocab, model_load_path) :classmethod: Create a new component or load one off disk if load path presented. This may raise an exception if the wrong type of config is provided. :param cnf: The config relevant to this components. :type cnf: ComponentConfig :param tokenizer: The base tokenizer. :type tokenizer: BaseTokenizer :param cdb: The CDB. :type cdb: CDB :param vocab: The Vocab. :type vocab: Vocab :param model_load_path: Model load path (if present). :type model_load_path: Optional[str] :Returns: **Self** -- The new components. .. py:attribute:: NAME_PREFIX :value: 'core_' .. py:property:: full_name :type: str Name with the component type (e.g ner, linking, meta). .. py:method:: is_core() Whether the component is a core component or not. :Returns: **bool** -- Whether this is a core component. .. py:attribute:: __slots__ :value: () .. py:attribute:: _is_protocol :value: True .. py:attribute:: _is_runtime_protocol :value: False .. py:method:: __init_subclass__(*args, **kwargs) :classmethod: .. py:method:: __class_getitem__(params) :classmethod: .. py:function:: get_cdb_from_old(old_path) Get the v2 CDB from a v1 CDB path. :param old_path: The v1 CDB path. :type old_path: str :Returns: **CDB** -- The v2 CDB. .. py:function:: get_config_from_old(path) Convert the saved v1 config into a v2 Config. :param path: The v1 config path. :type path: str :Returns: **Config** -- The v2 config. .. py:function:: get_vocab_from_old(old_path) Convert a v1 vocab file to a v2 Vocab. :param old_path: The v1 vocab file path. :type old_path: str :Returns: **Vocab** -- The v2 Vocab. .. py:function:: get_meta_cat_from_old(old_path, tokenizer) Convert a v1 MetaCAT folder to a v2 MetaCAT. :param old_path: The v1 MetaCAT file path. :type old_path: str :param tokenizer: The tokenizer. :type tokenizer: BaseTokenizer :Returns: **MetaCATAddon** -- The v2 MetaCAT. .. py:function:: get_rel_cat_from_old(cdb, old_path, tokenizer) Convert a v1 RelCAT folder to a v2 RelCAT. :param cdb: The base CDB. :type cdb: CDB :param old_path: The v1 RelCAT file path. :type old_path: str :param tokenizer: The tokenizer. :type tokenizer: BaseTokenizer :Returns: **RelCATAddon** -- The v2 RelCAT. .. py:function:: get_trf_ner_from_old(old_path, tokenizer) .. py:function:: fix_subnames(cat) .. py:data:: logger .. py:class:: Converter(medcat1_model_pack_path, new_model_pack_path, ser_type = AvailableSerialisers.dill) Converts v1 models to v2 models. .. py:attribute:: cdb_name :value: 'cdb.dat' .. py:attribute:: vocab_name :value: 'vocab.dat' .. py:attribute:: config_name :value: 'config.json' .. py:method:: __init__(medcat1_model_pack_path, new_model_pack_path, ser_type = AvailableSerialisers.dill) .. py:attribute:: old_model_folder .. py:attribute:: new_model_folder .. py:attribute:: ser_type .. py:property:: expected_files_in_folder The base names of the required files in a folder for a v1 model. .. py:method:: _validate() .. py:method:: convert() Use the gathered information to convert to a v2 model. This converts the CDB, Vocab, and Config, in order and then created the model pack. If `self.new_model_folder` is set, the model will be saved as well. :Returns: **CAT** -- The model pack. .. py:function:: unpack(model_zip_path, target_folder) Unpack v1 model into target folder. :param model_zip_path: ZIP path. :type model_zip_path: str :param target_folder: Target folder. :type target_folder: str