medcat2.utils.legacy.conversion_all

Attributes

logger

Classes

CAT

This is a collection of serialisable model parts.

CoreComponentType

Generic enumeration.

AvailableSerialisers

Describes the available serialisers.

NoActionLinker

Base class for protocol classes.

Converter

Converts v1 models to v2 models.

Functions

get_cdb_from_old(old_path)

Get the v2 CDB from a v1 CDB path.

get_config_from_old(path)

Convert the saved v1 config into a v2 Config.

get_vocab_from_old(old_path)

Convert a v1 vocab file to a v2 Vocab.

get_meta_cat_from_old(old_path, tokenizer)

Convert a v1 MetaCAT folder to a v2 MetaCAT.

get_rel_cat_from_old(cdb, old_path, tokenizer)

Convert a v1 RelCAT folder to a v2 RelCAT.

get_trf_ner_from_old(old_path, tokenizer)

fix_subnames(cat)

unpack(model_zip_path, target_folder)

Unpack v1 model into target folder.

Module Contents

class medcat2.utils.legacy.conversion_all.CAT(cdb, vocab=None, config=None, model_load_path=None)

Bases: medcat2.storage.serialisables.AbstractSerialisable

This is a collection of serialisable model parts.

Parameters:
__init__(cdb, vocab=None, config=None, model_load_path=None)
Parameters:
Return type:

None

cdb
vocab = None
config = None
_trainer: medcat2.trainer.Trainer | None = None
_pipeline
usage_monitor
_recreate_pipe(model_load_path=None)
Parameters:

model_load_path (Optional[str])

Return type:

medcat2.pipeline.pipeline.Pipeline

classmethod get_init_attrs()
Return type:

list[str]

classmethod ignore_attrs()
Return type:

list[str]

__call__(text)
Parameters:

text (str)

Return type:

Optional[medcat2.tokenizing.tokens.MutableDocument]

_ensure_not_training()

Method to ensure config is not set to train.

config.components.linking.train should only be True while training and not during inference. This aalso corrects the setting if necessary.

Return type:

None

get_entities(text: str, only_cui: Literal[False] = False) medcat2.data.entities.Entities
get_entities(text: str, only_cui: Literal[True] = True) medcat2.data.entities.OnlyCUIEntities
get_entities(text: str, only_cui: bool = False) dict | medcat2.data.entities.Entities | medcat2.data.entities.OnlyCUIEntities

Get the entities recognised and linked within the provided text.

This will run the text through the pipeline and annotated the recognised and linked entities.

Parameters:
  • text (str) – The text to use.

  • only_cui (bool, optional) – Whether to only output the CUIs rather than the entire context. Defaults to False.

Returns:

Union[dict, Entities, OnlyCUIEntities] – The entities found and linked within the text.

_mp_worker_func(texts_and_indices)
Parameters:

texts_and_indices (list[tuple[str, str, bool]])

Return type:

list[tuple[str, str, Union[dict, medcat2.data.entities.Entities, medcat2.data.entities.OnlyCUIEntities]]]

_generate_batches_by_char_length(text_iter, batch_size_chars, only_cui)
Parameters:
  • text_iter (Union[Iterator[str], Iterator[tuple[str, str]]])

  • batch_size_chars (int)

  • only_cui (bool)

Return type:

Iterator[list[tuple[str, str, bool]]]

_generate_batches(text_iter, batch_size, batch_size_chars, only_cui)
Parameters:
  • text_iter (Union[Iterator[str], Iterator[tuple[str, str]]])

  • batch_size (int)

  • batch_size_chars (int)

  • only_cui (bool)

Return type:

Iterator[list[tuple[str, str, bool]]]

_generate_simple_batches(text_iter, batch_size, only_cui)
Parameters:
  • text_iter (Union[Iterator[str], Iterator[tuple[str, str]]])

  • batch_size (int)

  • only_cui (bool)

Return type:

Iterator[list[tuple[str, str, bool]]]

_mp_one_batch_per_process(executor, batch_iter, external_processes)
Parameters:
  • executor (concurrent.futures.ProcessPoolExecutor)

  • batch_iter (Iterator[list[tuple[str, str, bool]]])

  • external_processes (int)

Return type:

Iterator[tuple[str, Union[dict, medcat2.data.entities.Entities, medcat2.data.entities.OnlyCUIEntities]]]

get_entities_multi_texts(texts, only_cui=False, n_process=1, batch_size=-1, batch_size_chars=1000000)

Get entities from multiple texts (potentially in parallel).

If n_process > 1, n_process - 1 new processes will be created and data will be processed on those as well as the main process in parallel.

Parameters:
  • texts (Union[Iterable[str], Iterable[tuple[str, str]]]) – The input text. Either an iterable of raw text or one with in the format of (text_index, text).

  • only_cui (bool) – Whether to only return CUIs rather than other information like start/end and annotated value. Defaults to False.

  • n_process (int) – Number of processes to use. Defaults to 1.

  • batch_size (int) – The number of texts to batch at a time. A batch of the specified size will be given to each worker process. Defaults to -1 and in this case the character count will be used instead.

  • batch_size_chars (int) – The maximum number of characters to process in a batch. Each process will be given batch of texts with a total number of characters not exceeding this value. Defaults to 1,000,000 characters. Set to -1 to disable.

Yields:

Iterator[tuple[str, Union[dict, Entities, OnlyCUIEntities]]] – The results in the format of (text_index, entities).

Return type:

Iterator[tuple[str, Union[dict, medcat2.data.entities.Entities, medcat2.data.entities.OnlyCUIEntities]]]

_get_entity(ent, doc_tokens, cui)
Parameters:
Return type:

medcat2.data.entities.Entity

_doc_to_out_entity(ent, doc_tokens, only_cui)
Parameters:
Return type:

tuple[int, Union[medcat2.data.entities.Entity, str]]

_doc_to_out(doc, only_cui, out_with_text=False)
Parameters:
Return type:

Union[medcat2.data.entities.Entities, medcat2.data.entities.OnlyCUIEntities]

property trainer

The trainer object.

save_model_pack(target_folder, pack_name=DEFAULT_PACK_NAME, serialiser_type='dill', make_archive=True, only_archive=False, add_hash_to_pack_name=True, change_description=None)

Save model pack.

The resulting model pack name will have the hash of the model pack in its name if (and only if) the default model pack name is used.

Parameters:
  • target_folder (str) – The folder to save the pack in.

  • pack_name (str, optional) – The model pack name. Defaults to DEFAULT_PACK_NAME.

  • serialiser_type (Union[str, AvailableSerialisers], optional) – The serialiser type. Defaults to ‘dill’.

  • make_archive (bool) – Whether to make the arhive /.zip file. Defaults to True.

  • only_archive (bool) – Whether to clear the non-compressed folder. Defaults to False.

  • add_hash_to_pack_name (bool) – Whether to add the hash to the pack name. This is only relevant if pack_name is specified. Defaults to True.

  • change_description (Optional[str]) – If provided, this the description will be added to the model description. Defaults to None.

Returns:

str – The final model pack path.

Return type:

str

_get_hash()
Return type:

str

_versioning(change_description)
Parameters:

change_description (Optional[str])

Return type:

str

classmethod load_model_pack(model_pack_path)

Load the model pack from file.

Parameters:

model_pack_path (str) – The model pack path.

Raises:

ValueError – If the saved data does not represent a model pack.

Returns:

CAT – The loaded model pack.

Return type:

CAT

get_model_card(as_dict: Literal[True]) medcat2.data.model_card.ModelCard
get_model_card(as_dict: Literal[False]) str

Get the model card either a (nested) dict or a json string.

Parameters:

as_dict (bool) – Whether to return as dict. Defaults to False.

Returns:

Union[str, ModelCard] – The model card.

__eq__(other)
Parameters:

other (Any)

Return type:

bool

add_addon(addon)
Parameters:

addon (medcat2.components.addons.addons.AddonComponent)

Return type:

None

get_strategy()
Return type:

SerialisingStrategy

classmethod include_properties()
Return type:

list[str]

class medcat2.utils.legacy.conversion_all.CoreComponentType

Bases: enum.Enum

Generic enumeration.

Derive from this class to define new enumerations.

tagging
token_normalizing
ner
linking
__new__(value)
_generate_next_value_(start, count, last_values)

Generate the next value when not given.

name: the name of the member start: the initial start value or None count: the number of existing members last_value: the last value assigned or None

classmethod _missing_(value)
__repr__()
__str__()
__dir__()

Returns all members and all public methods

__format__(format_spec)

Returns format using actual value type unless __str__ has been overridden.

__hash__()
__reduce_ex__(proto)
name()

The name of the Enum member.

value()

The value of the Enum member.

class medcat2.utils.legacy.conversion_all.AvailableSerialisers

Bases: enum.Enum

Describes the available serialisers.

dill
json
write_to(file_path)
Parameters:

file_path (str)

Return type:

None

classmethod from_file(file_path)
Parameters:

file_path (str)

Return type:

AvailableSerialisers

__new__(value)
_generate_next_value_(start, count, last_values)

Generate the next value when not given.

name: the name of the member start: the initial start value or None count: the number of existing members last_value: the last value assigned or None

classmethod _missing_(value)
__repr__()
__str__()
__dir__()

Returns all members and all public methods

__format__(format_spec)

Returns format using actual value type unless __str__ has been overridden.

__hash__()
__reduce_ex__(proto)
name()

The name of the Enum member.

value()

The value of the Enum member.

class medcat2.utils.legacy.conversion_all.NoActionLinker

Bases: medcat2.components.types.AbstractCoreComponent

Base class for protocol classes.

Protocol classes are defined as:

class Proto(Protocol):
    def meth(self) -> int:
        ...

Such classes are primarily used with static type checkers that recognize structural subtyping (static duck-typing), for example:

class C:
    def meth(self) -> int:
        return 0

def func(x: Proto) -> int:
    return x.meth()

func(C())  # Passes static type check

See PEP 544 for details. Protocol classes decorated with @typing.runtime_checkable act as simple-minded runtime protocols that check only the presence of given attributes, ignoring their type signatures. Protocol classes can be generic, they are defined as:

class GenProto(Protocol[T]):
    def meth(self) -> T:
        ...
name = 'no_action'

The name of the component.

get_type()
__call__(doc)
Parameters:

doc (medcat2.tokenizing.tokens.MutableDocument)

Return type:

medcat2.tokenizing.tokens.MutableDocument

classmethod get_init_args(tokenizer, cdb, vocab, model_load_path)

Get the init arguments for the component.

Parameters:
  • tokenizer (BaseTokenizer) – The tokenizer.

  • cdb (CDB) – The CDB.

  • vocab (Vocab) – The Vocab.

  • model_load_path (Optional[str]) – The model load path (or None).

Returns:

list[Any] – The list of init arguments.

Return type:

list[Any]

classmethod get_init_kwargs(tokenizer, cdb, vocab, model_load_path)

Get init keyword arguments for the component.

Parameters:
  • tokenizer (BaseTokenizer) – The tokenizer.

  • cdb (CDB) – The CDB.

  • vocab (Vocab) – The Vocab.

  • model_load_path (Optional[str]) – The model load path (or None).

Returns:

dict[str, Any] – The keywrod arguments.

Return type:

dict[str, Any]

NAME_PREFIX = 'core_'
property full_name: str

Name with the component type (e.g ner, linking, meta).

Return type:

str

is_core()

Whether the component is a core component or not.

Returns:

bool – Whether this is a core component.

Return type:

bool

__slots__ = ()
_is_protocol = True
_is_runtime_protocol = False
classmethod __init_subclass__(*args, **kwargs)
classmethod __class_getitem__(params)
medcat2.utils.legacy.conversion_all.get_cdb_from_old(old_path)

Get the v2 CDB from a v1 CDB path.

Parameters:

old_path (str) – The v1 CDB path.

Returns:

CDB – The v2 CDB.

Return type:

medcat2.cdb.CDB

medcat2.utils.legacy.conversion_all.get_config_from_old(path)

Convert the saved v1 config into a v2 Config.

Parameters:

path (str) – The v1 config path.

Returns:

Config – The v2 config.

Return type:

medcat2.config.Config

medcat2.utils.legacy.conversion_all.get_vocab_from_old(old_path)

Convert a v1 vocab file to a v2 Vocab.

Parameters:

old_path (str) – The v1 vocab file path.

Returns:

Vocab – The v2 Vocab.

Return type:

medcat2.vocab.Vocab

medcat2.utils.legacy.conversion_all.get_meta_cat_from_old(old_path, tokenizer)

Convert a v1 MetaCAT folder to a v2 MetaCAT.

Parameters:
  • old_path (str) – The v1 MetaCAT file path.

  • tokenizer (BaseTokenizer) – The tokenizer.

Returns:

MetaCATAddon – The v2 MetaCAT.

Return type:

medcat2.components.addons.meta_cat.meta_cat.MetaCATAddon

medcat2.utils.legacy.conversion_all.get_rel_cat_from_old(cdb, old_path, tokenizer)

Convert a v1 RelCAT folder to a v2 RelCAT.

Parameters:
  • cdb (CDB) – The base CDB.

  • old_path (str) – The v1 RelCAT file path.

  • tokenizer (BaseTokenizer) – The tokenizer.

Returns:

RelCATAddon – The v2 RelCAT.

Return type:

medcat2.components.addons.relation_extraction.rel_cat.RelCATAddon

medcat2.utils.legacy.conversion_all.get_trf_ner_from_old(old_path, tokenizer)
Parameters:
Return type:

medcat2.components.ner.trf.transformers_ner.TransformersNER

medcat2.utils.legacy.conversion_all.fix_subnames(cat)
Parameters:

cat (medcat2.cat.CAT)

Return type:

None

medcat2.utils.legacy.conversion_all.logger
class medcat2.utils.legacy.conversion_all.Converter(medcat1_model_pack_path, new_model_pack_path, ser_type=AvailableSerialisers.dill)

Converts v1 models to v2 models.

Parameters:
cdb_name = 'cdb.dat'
vocab_name = 'vocab.dat'
config_name = 'config.json'
__init__(medcat1_model_pack_path, new_model_pack_path, ser_type=AvailableSerialisers.dill)
Parameters:
old_model_folder
new_model_folder
ser_type
property expected_files_in_folder

The base names of the required files in a folder for a v1 model.

_validate()
convert()

Use the gathered information to convert to a v2 model.

This converts the CDB, Vocab, and Config, in order and then created the model pack.

If self.new_model_folder is set, the model will be saved as well.

Returns:

CAT – The model pack.

Return type:

medcat2.cat.CAT

medcat2.utils.legacy.conversion_all.unpack(model_zip_path, target_folder)

Unpack v1 model into target folder.

Parameters:
  • model_zip_path (str) – ZIP path.

  • target_folder (str) – Target folder.