medcat.utils.legacy.conversion_all
Attributes
Classes
This is a collection of serialisable model parts. |
|
Generic enumeration. |
|
Describes the available serialisers. |
|
Base class for protocol classes. |
|
Converts v1 models to v2 models. |
Functions
|
Get the v2 CDB from a v1 CDB path. |
|
Convert the saved v1 config into a v2 Config. |
|
Convert a v1 vocab file to a v2 Vocab. |
|
Convert a v1 MetaCAT folder to a v2 MetaCAT. |
|
Convert a v1 RelCAT folder to a v2 RelCAT. |
|
|
|
|
|
Unpack v1 model into target folder. |
Module Contents
- class medcat.utils.legacy.conversion_all.CAT(cdb, vocab=None, config=None, model_load_path=None)
Bases:
medcat.storage.serialisables.AbstractSerialisableThis is a collection of serialisable model parts.
- Parameters:
cdb (medcat.cdb.CDB)
vocab (Union[medcat.vocab.Vocab, None])
config (Optional[medcat.config.config.Config])
model_load_path (Optional[str])
- __init__(cdb, vocab=None, config=None, model_load_path=None)
- Parameters:
cdb (medcat.cdb.CDB)
vocab (Union[medcat.vocab.Vocab, None])
config (Optional[medcat.config.config.Config])
model_load_path (Optional[str])
- Return type:
None
- cdb
- vocab = None
- config = None
- _trainer: medcat.trainer.Trainer | None = None
- _pipeline
- usage_monitor
- _recreate_pipe(model_load_path=None)
- Parameters:
model_load_path (Optional[str])
- Return type:
- classmethod get_init_attrs()
- Return type:
list[str]
- classmethod ignore_attrs()
- Return type:
list[str]
- __call__(text)
- Parameters:
text (str)
- Return type:
Optional[medcat.tokenizing.tokens.MutableDocument]
- _ensure_not_training()
Method to ensure config is not set to train.
config.components.linking.train should only be True while training and not during inference. This aalso corrects the setting if necessary.
- Return type:
None
- get_entities(text: str, only_cui: Literal[False] = False) medcat.data.entities.Entities
- get_entities(text: str, only_cui: Literal[True] = True) medcat.data.entities.OnlyCUIEntities
- get_entities(text: str, only_cui: bool = False) dict | medcat.data.entities.Entities | medcat.data.entities.OnlyCUIEntities
Get the entities recognised and linked within the provided text.
This will run the text through the pipeline and annotated the recognised and linked entities.
- Parameters:
text (str) – The text to use.
only_cui (bool, optional) – Whether to only output the CUIs rather than the entire context. Defaults to False.
- Returns:
Union[dict, Entities, OnlyCUIEntities] – The entities found and linked within the text.
- _mp_worker_func(texts_and_indices)
- Parameters:
texts_and_indices (list[tuple[str, str, bool]])
- Return type:
list[tuple[str, str, Union[dict, medcat.data.entities.Entities, medcat.data.entities.OnlyCUIEntities]]]
- _generate_batches_by_char_length(text_iter, batch_size_chars, only_cui)
- Parameters:
text_iter (Union[Iterator[str], Iterator[tuple[str, str]]])
batch_size_chars (int)
only_cui (bool)
- Return type:
Iterator[list[tuple[str, str, bool]]]
- _generate_batches(text_iter, batch_size, batch_size_chars, only_cui)
- Parameters:
text_iter (Union[Iterator[str], Iterator[tuple[str, str]]])
batch_size (int)
batch_size_chars (int)
only_cui (bool)
- Return type:
Iterator[list[tuple[str, str, bool]]]
- _generate_simple_batches(text_iter, batch_size, only_cui)
- Parameters:
text_iter (Union[Iterator[str], Iterator[tuple[str, str]]])
batch_size (int)
only_cui (bool)
- Return type:
Iterator[list[tuple[str, str, bool]]]
- _mp_one_batch_per_process(executor, batch_iter, external_processes)
- Parameters:
executor (concurrent.futures.ProcessPoolExecutor)
batch_iter (Iterator[list[tuple[str, str, bool]]])
external_processes (int)
- Return type:
Iterator[tuple[str, Union[dict, medcat.data.entities.Entities, medcat.data.entities.OnlyCUIEntities]]]
- get_entities_multi_texts(texts, only_cui=False, n_process=1, batch_size=-1, batch_size_chars=1000000)
Get entities from multiple texts (potentially in parallel).
If n_process > 1, n_process - 1 new processes will be created and data will be processed on those as well as the main process in parallel.
- Parameters:
texts (Union[Iterable[str], Iterable[tuple[str, str]]]) – The input text. Either an iterable of raw text or one with in the format of (text_index, text).
only_cui (bool) – Whether to only return CUIs rather than other information like start/end and annotated value. Defaults to False.
n_process (int) – Number of processes to use. Defaults to 1.
batch_size (int) – The number of texts to batch at a time. A batch of the specified size will be given to each worker process. Defaults to -1 and in this case the character count will be used instead.
batch_size_chars (int) – The maximum number of characters to process in a batch. Each process will be given batch of texts with a total number of characters not exceeding this value. Defaults to 1,000,000 characters. Set to -1 to disable.
- Yields:
Iterator[tuple[str, Union[dict, Entities, OnlyCUIEntities]]] – The results in the format of (text_index, entities).
- Return type:
Iterator[tuple[str, Union[dict, medcat.data.entities.Entities, medcat.data.entities.OnlyCUIEntities]]]
- _get_entity(ent, doc_tokens, cui)
- Parameters:
doc_tokens (list[str])
cui (str)
- Return type:
- get_addon_output(ent)
Get the addon output for the entity.
This includes a key-value pair for each addon that provides some. Sometimes same-type addons may combine their output under the same key.
- Parameters:
ent (MutableEntity) – The entity in quesiton.
- Raises:
ValueError – If unable to merge multiple addon output.
- Returns:
dict[str, dict] – All the addon output.
- Return type:
dict[str, dict]
- _doc_to_out_entity(ent, doc_tokens, only_cui)
- Parameters:
doc_tokens (list[str])
only_cui (bool)
- Return type:
tuple[int, Union[medcat.data.entities.Entity, str]]
- _doc_to_out(doc, only_cui, out_with_text=False)
- Parameters:
only_cui (bool)
out_with_text (bool)
- Return type:
Union[medcat.data.entities.Entities, medcat.data.entities.OnlyCUIEntities]
- property trainer
The trainer object.
- save_model_pack(target_folder, pack_name=DEFAULT_PACK_NAME, serialiser_type='dill', make_archive=True, only_archive=False, add_hash_to_pack_name=True, change_description=None)
Save model pack.
The resulting model pack name will have the hash of the model pack in its name if (and only if) the default model pack name is used.
- Parameters:
target_folder (str) – The folder to save the pack in.
pack_name (str, optional) – The model pack name. Defaults to DEFAULT_PACK_NAME.
serialiser_type (Union[str, AvailableSerialisers], optional) – The serialiser type. Defaults to ‘dill’.
make_archive (bool) – Whether to make the arhive /.zip file. Defaults to True.
only_archive (bool) – Whether to clear the non-compressed folder. Defaults to False.
add_hash_to_pack_name (bool) – Whether to add the hash to the pack name. This is only relevant if pack_name is specified. Defaults to True.
change_description (Optional[str]) – If provided, this the description will be added to the model description. Defaults to None.
- Returns:
str – The final model pack path.
- Return type:
str
- _get_hash()
- Return type:
str
- _versioning(change_description)
- Parameters:
change_description (Optional[str])
- Return type:
str
- classmethod attempt_unpack(zip_path)
Attempt unpack the zip to a folder and get the model pack path.
If the folder already exists, no unpacking is done.
- Parameters:
zip_path (str) – The ZIP path
- Returns:
str – The model pack path
- Return type:
str
- classmethod load_model_pack(model_pack_path)
Load the model pack from file.
- Parameters:
model_pack_path (str) – The model pack path.
- Raises:
ValueError – If the saved data does not represent a model pack.
- Returns:
CAT – The loaded model pack.
- Return type:
- classmethod load_cdb(model_pack_path)
Loads the concept database from the provided model pack path
- Parameters:
model_pack_path (str) – path to model pack, zip or dir.
- Returns:
CDB – The loaded concept database
- Return type:
- get_model_card(as_dict: Literal[True]) medcat.data.model_card.ModelCard
- get_model_card(as_dict: Literal[False]) str
Get the model card either a (nested) dict or a json string.
- Parameters:
as_dict (bool) – Whether to return as dict. Defaults to False.
- Returns:
Union[str, ModelCard] – The model card.
- __eq__(other)
- Parameters:
other (Any)
- Return type:
bool
- add_addon(addon)
- Parameters:
- Return type:
None
- get_strategy()
- Return type:
- classmethod include_properties()
- Return type:
list[str]
- class medcat.utils.legacy.conversion_all.CoreComponentType
Bases:
enum.EnumGeneric enumeration.
Derive from this class to define new enumerations.
- tagging
- token_normalizing
- ner
- linking
- __new__(value)
- _generate_next_value_(start, count, last_values)
Generate the next value when not given.
name: the name of the member start: the initial start value or None count: the number of existing members last_value: the last value assigned or None
- classmethod _missing_(value)
- __repr__()
- __str__()
- __dir__()
Returns all members and all public methods
- __format__(format_spec)
Returns format using actual value type unless __str__ has been overridden.
- __hash__()
- __reduce_ex__(proto)
- name()
The name of the Enum member.
- value()
The value of the Enum member.
- class medcat.utils.legacy.conversion_all.AvailableSerialisers
Bases:
enum.EnumDescribes the available serialisers.
- dill
- json
- write_to(file_path)
- Parameters:
file_path (str)
- Return type:
None
- classmethod from_file(file_path)
- Parameters:
file_path (str)
- Return type:
- __new__(value)
- _generate_next_value_(start, count, last_values)
Generate the next value when not given.
name: the name of the member start: the initial start value or None count: the number of existing members last_value: the last value assigned or None
- classmethod _missing_(value)
- __repr__()
- __str__()
- __dir__()
Returns all members and all public methods
- __format__(format_spec)
Returns format using actual value type unless __str__ has been overridden.
- __hash__()
- __reduce_ex__(proto)
- name()
The name of the Enum member.
- value()
The value of the Enum member.
- class medcat.utils.legacy.conversion_all.NoActionLinker
Bases:
medcat.components.types.AbstractCoreComponentBase class for protocol classes.
Protocol classes are defined as:
class Proto(Protocol): def meth(self) -> int: ...
Such classes are primarily used with static type checkers that recognize structural subtyping (static duck-typing), for example:
class C: def meth(self) -> int: return 0 def func(x: Proto) -> int: return x.meth() func(C()) # Passes static type check
See PEP 544 for details. Protocol classes decorated with @typing.runtime_checkable act as simple-minded runtime protocols that check only the presence of given attributes, ignoring their type signatures. Protocol classes can be generic, they are defined as:
class GenProto(Protocol[T]): def meth(self) -> T: ...
- name = 'no_action'
The name of the component.
- get_type()
- __call__(doc)
- Parameters:
- Return type:
- classmethod create_new_component(cnf, tokenizer, cdb, vocab, model_load_path)
Create a new component or load one off disk if load path presented.
This may raise an exception if the wrong type of config is provided.
- Parameters:
cnf (ComponentConfig) – The config relevant to this components.
tokenizer (BaseTokenizer) – The base tokenizer.
cdb (CDB) – The CDB.
vocab (Vocab) – The Vocab.
model_load_path (Optional[str]) – Model load path (if present).
- Returns:
Self – The new components.
- Return type:
- NAME_PREFIX = 'core_'
- property full_name: str
Name with the component type (e.g ner, linking, meta).
- Return type:
str
- is_core()
Whether the component is a core component or not.
- Returns:
bool – Whether this is a core component.
- Return type:
bool
- __slots__ = ()
- _is_protocol = True
- _is_runtime_protocol = False
- classmethod __init_subclass__(*args, **kwargs)
- classmethod __class_getitem__(params)
- medcat.utils.legacy.conversion_all.get_cdb_from_old(old_path)
Get the v2 CDB from a v1 CDB path.
- Parameters:
old_path (str) – The v1 CDB path.
- Returns:
CDB – The v2 CDB.
- Return type:
- medcat.utils.legacy.conversion_all.get_config_from_old(path)
Convert the saved v1 config into a v2 Config.
- Parameters:
path (str) – The v1 config path.
- Returns:
Config – The v2 config.
- Return type:
- medcat.utils.legacy.conversion_all.get_vocab_from_old(old_path)
Convert a v1 vocab file to a v2 Vocab.
- Parameters:
old_path (str) – The v1 vocab file path.
- Returns:
Vocab – The v2 Vocab.
- Return type:
- medcat.utils.legacy.conversion_all.get_meta_cat_from_old(old_path, tokenizer)
Convert a v1 MetaCAT folder to a v2 MetaCAT.
- Parameters:
old_path (str) – The v1 MetaCAT file path.
tokenizer (BaseTokenizer) – The tokenizer.
- Returns:
MetaCATAddon – The v2 MetaCAT.
- Return type:
- medcat.utils.legacy.conversion_all.get_rel_cat_from_old(cdb, old_path, tokenizer)
Convert a v1 RelCAT folder to a v2 RelCAT.
- Parameters:
cdb (CDB) – The base CDB.
old_path (str) – The v1 RelCAT file path.
tokenizer (BaseTokenizer) – The tokenizer.
- Returns:
RelCATAddon – The v2 RelCAT.
- Return type:
medcat.components.addons.relation_extraction.rel_cat.RelCATAddon
- medcat.utils.legacy.conversion_all.get_trf_ner_from_old(old_path, tokenizer)
- Parameters:
old_path (str)
tokenizer (medcat.tokenizing.tokenizers.BaseTokenizer)
- Return type:
- medcat.utils.legacy.conversion_all.fix_subnames(cat)
- Parameters:
cat (medcat.cat.CAT)
- Return type:
None
- medcat.utils.legacy.conversion_all.logger
- class medcat.utils.legacy.conversion_all.Converter(medcat1_model_pack_path, new_model_pack_path, ser_type=AvailableSerialisers.dill)
Converts v1 models to v2 models.
- Parameters:
medcat1_model_pack_path (str)
new_model_pack_path (Optional[str])
- cdb_name = 'cdb.dat'
- vocab_name = 'vocab.dat'
- config_name = 'config.json'
- __init__(medcat1_model_pack_path, new_model_pack_path, ser_type=AvailableSerialisers.dill)
- Parameters:
medcat1_model_pack_path (str)
new_model_pack_path (Optional[str])
- old_model_folder
- new_model_folder
- ser_type
- property expected_files_in_folder
The base names of the required files in a folder for a v1 model.
- _validate()
- convert()
Use the gathered information to convert to a v2 model.
This converts the CDB, Vocab, and Config, in order and then created the model pack.
If self.new_model_folder is set, the model will be saved as well.
- Returns:
CAT – The model pack.
- Return type:
- medcat.utils.legacy.conversion_all.unpack(model_zip_path, target_folder)
Unpack v1 model into target folder.
- Parameters:
model_zip_path (str) – ZIP path.
target_folder (str) – Target folder.