medcat.utils.legacy.conversion_all

Attributes

logger

Classes

CAT

This is a collection of serialisable model parts.

CoreComponentType

Generic enumeration.

AvailableSerialisers

Describes the available serialisers.

NoActionLinker

Base class for protocol classes.

Converter

Converts v1 models to v2 models.

Functions

get_cdb_from_old(old_path)

Get the v2 CDB from a v1 CDB path.

get_config_from_old(path)

Convert the saved v1 config into a v2 Config.

get_vocab_from_old(old_path)

Convert a v1 vocab file to a v2 Vocab.

get_meta_cat_from_old(old_path, tokenizer)

Convert a v1 MetaCAT folder to a v2 MetaCAT.

get_rel_cat_from_old(cdb, old_path, tokenizer)

Convert a v1 RelCAT folder to a v2 RelCAT.

get_trf_ner_from_old(old_path, tokenizer)

fix_subnames(cat)

unpack(model_zip_path, target_folder)

Unpack v1 model into target folder.

Module Contents

class medcat.utils.legacy.conversion_all.CAT(cdb, vocab=None, config=None, model_load_path=None)

Bases: medcat.storage.serialisables.AbstractSerialisable

This is a collection of serialisable model parts.

Parameters:
__init__(cdb, vocab=None, config=None, model_load_path=None)
Parameters:
Return type:

None

cdb
vocab = None
config = None
_trainer: medcat.trainer.Trainer | None = None
_pipeline
usage_monitor
_recreate_pipe(model_load_path=None)
Parameters:

model_load_path (Optional[str])

Return type:

medcat.pipeline.pipeline.Pipeline

classmethod get_init_attrs()
Return type:

list[str]

classmethod ignore_attrs()
Return type:

list[str]

__call__(text)
Parameters:

text (str)

Return type:

Optional[medcat.tokenizing.tokens.MutableDocument]

_ensure_not_training()

Method to ensure config is not set to train.

config.components.linking.train should only be True while training and not during inference. This aalso corrects the setting if necessary.

Return type:

None

get_entities(text: str, only_cui: Literal[False] = False) medcat.data.entities.Entities
get_entities(text: str, only_cui: Literal[True] = True) medcat.data.entities.OnlyCUIEntities
get_entities(text: str, only_cui: bool = False) dict | medcat.data.entities.Entities | medcat.data.entities.OnlyCUIEntities

Get the entities recognised and linked within the provided text.

This will run the text through the pipeline and annotated the recognised and linked entities.

Parameters:
  • text (str) – The text to use.

  • only_cui (bool, optional) – Whether to only output the CUIs rather than the entire context. Defaults to False.

Returns:

Union[dict, Entities, OnlyCUIEntities] – The entities found and linked within the text.

_mp_worker_func(texts_and_indices)
Parameters:

texts_and_indices (list[tuple[str, str, bool]])

Return type:

list[tuple[str, str, Union[dict, medcat.data.entities.Entities, medcat.data.entities.OnlyCUIEntities]]]

_generate_batches_by_char_length(text_iter, batch_size_chars, only_cui)
Parameters:
  • text_iter (Union[Iterator[str], Iterator[tuple[str, str]]])

  • batch_size_chars (int)

  • only_cui (bool)

Return type:

Iterator[list[tuple[str, str, bool]]]

_generate_batches(text_iter, batch_size, batch_size_chars, only_cui)
Parameters:
  • text_iter (Union[Iterator[str], Iterator[tuple[str, str]]])

  • batch_size (int)

  • batch_size_chars (int)

  • only_cui (bool)

Return type:

Iterator[list[tuple[str, str, bool]]]

_generate_simple_batches(text_iter, batch_size, only_cui)
Parameters:
  • text_iter (Union[Iterator[str], Iterator[tuple[str, str]]])

  • batch_size (int)

  • only_cui (bool)

Return type:

Iterator[list[tuple[str, str, bool]]]

_mp_one_batch_per_process(executor, batch_iter, external_processes)
Parameters:
  • executor (concurrent.futures.ProcessPoolExecutor)

  • batch_iter (Iterator[list[tuple[str, str, bool]]])

  • external_processes (int)

Return type:

Iterator[tuple[str, Union[dict, medcat.data.entities.Entities, medcat.data.entities.OnlyCUIEntities]]]

get_entities_multi_texts(texts, only_cui=False, n_process=1, batch_size=-1, batch_size_chars=1000000)

Get entities from multiple texts (potentially in parallel).

If n_process > 1, n_process - 1 new processes will be created and data will be processed on those as well as the main process in parallel.

Parameters:
  • texts (Union[Iterable[str], Iterable[tuple[str, str]]]) – The input text. Either an iterable of raw text or one with in the format of (text_index, text).

  • only_cui (bool) – Whether to only return CUIs rather than other information like start/end and annotated value. Defaults to False.

  • n_process (int) – Number of processes to use. Defaults to 1.

  • batch_size (int) – The number of texts to batch at a time. A batch of the specified size will be given to each worker process. Defaults to -1 and in this case the character count will be used instead.

  • batch_size_chars (int) – The maximum number of characters to process in a batch. Each process will be given batch of texts with a total number of characters not exceeding this value. Defaults to 1,000,000 characters. Set to -1 to disable.

Yields:

Iterator[tuple[str, Union[dict, Entities, OnlyCUIEntities]]] – The results in the format of (text_index, entities).

Return type:

Iterator[tuple[str, Union[dict, medcat.data.entities.Entities, medcat.data.entities.OnlyCUIEntities]]]

_get_entity(ent, doc_tokens, cui)
Parameters:
Return type:

medcat.data.entities.Entity

get_addon_output(ent)

Get the addon output for the entity.

This includes a key-value pair for each addon that provides some. Sometimes same-type addons may combine their output under the same key.

Parameters:

ent (MutableEntity) – The entity in quesiton.

Raises:

ValueError – If unable to merge multiple addon output.

Returns:

dict[str, dict] – All the addon output.

Return type:

dict[str, dict]

_doc_to_out_entity(ent, doc_tokens, only_cui)
Parameters:
Return type:

tuple[int, Union[medcat.data.entities.Entity, str]]

_doc_to_out(doc, only_cui, out_with_text=False)
Parameters:
Return type:

Union[medcat.data.entities.Entities, medcat.data.entities.OnlyCUIEntities]

property trainer

The trainer object.

save_model_pack(target_folder, pack_name=DEFAULT_PACK_NAME, serialiser_type='dill', make_archive=True, only_archive=False, add_hash_to_pack_name=True, change_description=None)

Save model pack.

The resulting model pack name will have the hash of the model pack in its name if (and only if) the default model pack name is used.

Parameters:
  • target_folder (str) – The folder to save the pack in.

  • pack_name (str, optional) – The model pack name. Defaults to DEFAULT_PACK_NAME.

  • serialiser_type (Union[str, AvailableSerialisers], optional) – The serialiser type. Defaults to ‘dill’.

  • make_archive (bool) – Whether to make the arhive /.zip file. Defaults to True.

  • only_archive (bool) – Whether to clear the non-compressed folder. Defaults to False.

  • add_hash_to_pack_name (bool) – Whether to add the hash to the pack name. This is only relevant if pack_name is specified. Defaults to True.

  • change_description (Optional[str]) – If provided, this the description will be added to the model description. Defaults to None.

Returns:

str – The final model pack path.

Return type:

str

_get_hash()
Return type:

str

_versioning(change_description)
Parameters:

change_description (Optional[str])

Return type:

str

classmethod attempt_unpack(zip_path)

Attempt unpack the zip to a folder and get the model pack path.

If the folder already exists, no unpacking is done.

Parameters:

zip_path (str) – The ZIP path

Returns:

str – The model pack path

Return type:

str

classmethod load_model_pack(model_pack_path)

Load the model pack from file.

Parameters:

model_pack_path (str) – The model pack path.

Raises:

ValueError – If the saved data does not represent a model pack.

Returns:

CAT – The loaded model pack.

Return type:

CAT

classmethod load_cdb(model_pack_path)

Loads the concept database from the provided model pack path

Parameters:

model_pack_path (str) – path to model pack, zip or dir.

Returns:

CDB – The loaded concept database

Return type:

medcat.cdb.CDB

get_model_card(as_dict: Literal[True]) medcat.data.model_card.ModelCard
get_model_card(as_dict: Literal[False]) str

Get the model card either a (nested) dict or a json string.

Parameters:

as_dict (bool) – Whether to return as dict. Defaults to False.

Returns:

Union[str, ModelCard] – The model card.

__eq__(other)
Parameters:

other (Any)

Return type:

bool

add_addon(addon)
Parameters:

addon (medcat.components.addons.addons.AddonComponent)

Return type:

None

get_strategy()
Return type:

SerialisingStrategy

classmethod include_properties()
Return type:

list[str]

class medcat.utils.legacy.conversion_all.CoreComponentType

Bases: enum.Enum

Generic enumeration.

Derive from this class to define new enumerations.

tagging
token_normalizing
ner
linking
__new__(value)
_generate_next_value_(start, count, last_values)

Generate the next value when not given.

name: the name of the member start: the initial start value or None count: the number of existing members last_value: the last value assigned or None

classmethod _missing_(value)
__repr__()
__str__()
__dir__()

Returns all members and all public methods

__format__(format_spec)

Returns format using actual value type unless __str__ has been overridden.

__hash__()
__reduce_ex__(proto)
name()

The name of the Enum member.

value()

The value of the Enum member.

class medcat.utils.legacy.conversion_all.AvailableSerialisers

Bases: enum.Enum

Describes the available serialisers.

dill
json
write_to(file_path)
Parameters:

file_path (str)

Return type:

None

classmethod from_file(file_path)
Parameters:

file_path (str)

Return type:

AvailableSerialisers

__new__(value)
_generate_next_value_(start, count, last_values)

Generate the next value when not given.

name: the name of the member start: the initial start value or None count: the number of existing members last_value: the last value assigned or None

classmethod _missing_(value)
__repr__()
__str__()
__dir__()

Returns all members and all public methods

__format__(format_spec)

Returns format using actual value type unless __str__ has been overridden.

__hash__()
__reduce_ex__(proto)
name()

The name of the Enum member.

value()

The value of the Enum member.

class medcat.utils.legacy.conversion_all.NoActionLinker

Bases: medcat.components.types.AbstractCoreComponent

Base class for protocol classes.

Protocol classes are defined as:

class Proto(Protocol):
    def meth(self) -> int:
        ...

Such classes are primarily used with static type checkers that recognize structural subtyping (static duck-typing), for example:

class C:
    def meth(self) -> int:
        return 0

def func(x: Proto) -> int:
    return x.meth()

func(C())  # Passes static type check

See PEP 544 for details. Protocol classes decorated with @typing.runtime_checkable act as simple-minded runtime protocols that check only the presence of given attributes, ignoring their type signatures. Protocol classes can be generic, they are defined as:

class GenProto(Protocol[T]):
    def meth(self) -> T:
        ...
name = 'no_action'

The name of the component.

get_type()
__call__(doc)
Parameters:

doc (medcat.tokenizing.tokens.MutableDocument)

Return type:

medcat.tokenizing.tokens.MutableDocument

classmethod create_new_component(cnf, tokenizer, cdb, vocab, model_load_path)

Create a new component or load one off disk if load path presented.

This may raise an exception if the wrong type of config is provided.

Parameters:
  • cnf (ComponentConfig) – The config relevant to this components.

  • tokenizer (BaseTokenizer) – The base tokenizer.

  • cdb (CDB) – The CDB.

  • vocab (Vocab) – The Vocab.

  • model_load_path (Optional[str]) – Model load path (if present).

Returns:

Self – The new components.

Return type:

NoActionLinker

NAME_PREFIX = 'core_'
property full_name: str

Name with the component type (e.g ner, linking, meta).

Return type:

str

is_core()

Whether the component is a core component or not.

Returns:

bool – Whether this is a core component.

Return type:

bool

__slots__ = ()
_is_protocol = True
_is_runtime_protocol = False
classmethod __init_subclass__(*args, **kwargs)
classmethod __class_getitem__(params)
medcat.utils.legacy.conversion_all.get_cdb_from_old(old_path)

Get the v2 CDB from a v1 CDB path.

Parameters:

old_path (str) – The v1 CDB path.

Returns:

CDB – The v2 CDB.

Return type:

medcat.cdb.CDB

medcat.utils.legacy.conversion_all.get_config_from_old(path)

Convert the saved v1 config into a v2 Config.

Parameters:

path (str) – The v1 config path.

Returns:

Config – The v2 config.

Return type:

medcat.config.Config

medcat.utils.legacy.conversion_all.get_vocab_from_old(old_path)

Convert a v1 vocab file to a v2 Vocab.

Parameters:

old_path (str) – The v1 vocab file path.

Returns:

Vocab – The v2 Vocab.

Return type:

medcat.vocab.Vocab

medcat.utils.legacy.conversion_all.get_meta_cat_from_old(old_path, tokenizer)

Convert a v1 MetaCAT folder to a v2 MetaCAT.

Parameters:
  • old_path (str) – The v1 MetaCAT file path.

  • tokenizer (BaseTokenizer) – The tokenizer.

Returns:

MetaCATAddon – The v2 MetaCAT.

Return type:

medcat.components.addons.meta_cat.MetaCATAddon

medcat.utils.legacy.conversion_all.get_rel_cat_from_old(cdb, old_path, tokenizer)

Convert a v1 RelCAT folder to a v2 RelCAT.

Parameters:
  • cdb (CDB) – The base CDB.

  • old_path (str) – The v1 RelCAT file path.

  • tokenizer (BaseTokenizer) – The tokenizer.

Returns:

RelCATAddon – The v2 RelCAT.

Return type:

medcat.components.addons.relation_extraction.rel_cat.RelCATAddon

medcat.utils.legacy.conversion_all.get_trf_ner_from_old(old_path, tokenizer)
Parameters:
Return type:

medcat.components.ner.trf.transformers_ner.TransformersNER

medcat.utils.legacy.conversion_all.fix_subnames(cat)
Parameters:

cat (medcat.cat.CAT)

Return type:

None

medcat.utils.legacy.conversion_all.logger
class medcat.utils.legacy.conversion_all.Converter(medcat1_model_pack_path, new_model_pack_path, ser_type=AvailableSerialisers.dill)

Converts v1 models to v2 models.

Parameters:
cdb_name = 'cdb.dat'
vocab_name = 'vocab.dat'
config_name = 'config.json'
__init__(medcat1_model_pack_path, new_model_pack_path, ser_type=AvailableSerialisers.dill)
Parameters:
old_model_folder
new_model_folder
ser_type
property expected_files_in_folder

The base names of the required files in a folder for a v1 model.

_validate()
convert()

Use the gathered information to convert to a v2 model.

This converts the CDB, Vocab, and Config, in order and then created the model pack.

If self.new_model_folder is set, the model will be saved as well.

Returns:

CAT – The model pack.

Return type:

medcat.cat.CAT

medcat.utils.legacy.conversion_all.unpack(model_zip_path, target_folder)

Unpack v1 model into target folder.

Parameters:
  • model_zip_path (str) – ZIP path.

  • target_folder (str) – Target folder.