medcat.components.addons.meta_cat

Submodules

Attributes

`__all__`
`_EXTRA_NAME`

Classes

`MetaCAT`	The MetaCAT class used for training 'Meta-Annotation' models,
`MetaCATAddon`	Base/abstract addon component class.
`MetaAnnotationValue`	dict() -> new empty dictionary

Functions

`ensure_optional_extras_installed`(package_name, extra_name)	Ensure that an optional dependency set is installed.
`get_meta_annotations`(entity)

Package Contents

medcat.components.addons.meta_cat.ensure_optional_extras_installed(package_name, extra_name)

Ensure that an optional dependency set is installed.

Parameters:

package_name (str) – The base package name.
extra_name (str) – The name of the extra dependency.

Raises:

MissingDependenciesError – If the extra dependency isn’t provided.

class medcat.components.addons.meta_cat.MetaCAT(tokenizer=None, embeddings=None, config=None, _model_state_dict=None)

Bases: medcat.storage.serialisables.AbstractSerialisable

The MetaCAT class used for training ‘Meta-Annotation’ models, i.e. annotations of clinical concept annotations. These are also known as properties or attributes of recognise entities sin similar tools such as MetaMap and cTakes.

This is a flexible model agnostic class that can learns any meta-annotation task, i.e. any multi-class classification task for recognised terms.

Parameters:

tokenizer (TokenizerWrapperBase) –
The Huggingface tokenizer instance. This can be a pre-trained tokenzier instance from a BERT-style model, or trained from scratch for the Bi-LSTM (w. attention) model that is currentl

used in most deployments.
embeddings (Tensor, numpy.ndarray) – embedding mapping (sub)word input id n-dim (sub)word embedding.
config (ConfigMetaCAT) – the configuration for MetaCAT. Param descriptions available in ConfigMetaCAT docs.
_model_state_dict (Optional[dict[str, Any]])

name = 'meta_cat'

_component_lock

classmethod get_init_attrs()

Return type:: list[str]

classmethod ignore_attrs()

Return type:: list[str]

classmethod include_properties()

Return type:: list[str]

property _model_state_dict

__init__(tokenizer=None, embeddings=None, config=None, _model_state_dict=None)

Parameters:

tokenizer (Optional[medcat.components.addons.meta_cat.mctokenizers.tokenizers.TokenizerWrapperBase])
embeddings (Optional[Union[torch.Tensor, numpy.ndarray]])
config (Optional[medcat.config.config_meta_cat.ConfigMetaCAT])
_model_state_dict (Optional[dict[str, Any]])

Return type:

None

config = None

tokenizer = None

embeddings

model

_reset_tokenizer_info()

get_model(embeddings)

Get the model

Parameters:: embeddings (Optional[Tensor]) – The embedding densor
Raises:: ValueError – If the meta model is not LSTM or BERT
Returns:: nn.Module – The module
Return type:: torch.nn.Module

get_hash()

A partial hash trying to catch differences between models.

Returns:: str – The hex hash.
Return type:: str

train_from_json(json_path, save_dir_path=None, data_oversampled=None, overwrite=False)

Train or continue training a model give a json_path containing a MedCATtrainer export. It will continue training if an existing model is loaded or start new training if the model is blank/new.

Parameters:

json_path (Union[str, list]) – Path/Paths to a MedCATtrainer export containing the meta_annotations we want to train for.
save_dir_path (Optional[str]) – In case we have aut_save_model (meaning during the training the best model will be saved) we need to set a save path. Defaults to None.
data_oversampled (Optional[list]) – In case of oversampling being performed, the data will be passed in the parameter allowing the model to be trained on original + synthetic data.
overwrite (bool) – Whether to allow overwriting the file if/when appropriate.

Returns:

dict – The resulting report.

Return type:

dict

train_raw(data_loaded, save_dir_path=None, data_oversampled=None, overwrite=False)

Train or continue training a model given raw data. It will continue training if an existing model is loaded or start new training if the model is blank/new.

The raw data is expected in the following format: {

‘projects’: [ # list of projects

{
‘name’: ‘<project_name>’, ‘documents’: [ # list of documents

{
‘name’: ‘<document_name>’, ‘text’: ‘<text_of_document>’, ‘annotations’: [ # list of annotations

{
# start index of the annotation ‘start’: -1, ‘end’: 1, # end index of the annotation ‘cui’: ‘cui’, ‘value’: ‘<annotation_value>’

],

]

]

}

Parameters:

data_loaded (dict) – The raw data we want to train for.
save_dir_path (Optional[str]) – In case we have aut_save_model (meaning during the training the best model will be saved) we need to set a save path. Defaults to None.
data_oversampled (Optional[list]) –
In case of oversampling being performed, the data will be passed in the parameter allowing the model to be trained on original + synthetic data. The format of which is expected: [[[‘text’,’of’,’the’,’document’], [index of medical entity],

”label” ],

[‘text’,’of’,’the’,’document’], [index of medical entity],
”label” ]]
overwrite (bool) – Whether to allow overwriting the file if/when appropriate.

Returns:

dict – The resulting report.

Raises:

Exception – If no save path is specified, or category name not in data.
AssertionError – If no tokeniser is set
FileNotFoundError – If phase_number is set to 2 and model.dat file is not found
KeyError – If phase_number is set to 2 and model.dat file contains mismatched architecture

Return type:

dict

eval(json_path)

Evaluate from json.

Parameters:

json_path (str) – The json file ath

Returns:

dict – The resulting model dict

Raises:

AssertionError – If self.tokenizer
Exception – If the category name does not exist

Return type:

dict

get_ents(doc)

Parameters:: doc (medcat.tokenizing.tokens.MutableDocument)
Return type:: Iterable[medcat.tokenizing.tokens.MutableEntity]

prepare_document(doc, input_ids, offset_mapping, lowercase)

Prepares document.

Parameters:

doc (Doc) – The document
input_ids (list) – Input ids
offset_mapping (list) – Offset mappings
lowercase (bool) – Whether to use lower case replace center

Returns:

tuple[dict, list] – Entity id to index mapping and Samples

Return type:

tuple[dict, list]

static batch_generator(stream, batch_size_chars)

Generator for batch of documents.

Parameters:

stream (Iterable[MutableDocument]) – The document stream
batch_size_chars (int) – Number of characters per batch

Yields:

list[MutableDocument] – The batch of documents.

Return type:

Iterable[list[medcat.tokenizing.tokens.MutableDocument]]

_set_meta_anns(doc, id2category_value)

Parameters:

doc (medcat.tokenizing.tokens.MutableDocument)
id2category_value (dict)

Return type:

medcat.tokenizing.tokens.MutableDocument

__call__(doc)

Process one document, used in the spacy pipeline for sequential document processing.

Parameters:: doc (Doc) – A spacy document
Returns:: Doc – The same spacy document.
Return type:: medcat.tokenizing.tokens.MutableDocument

get_model_card(as_dict=False)

A minimal model card.

Parameters:: as_dict (bool) – Return the model card as a dictionary instead of a str. Defaults to False.
Returns:: Union[str, dict] – An indented JSON object. OR A JSON object in dict form.
Return type:: Union[str, dict]

__repr__()

Prints the model_card for this MetaCAT instance.

Returns:

the ‘Model Card’ for this MetaCAT instance. This includes NER+L
config and any MetaCATs

get_strategy()

Return type:: SerialisingStrategy

__eq__(other)

Parameters:: other (Any)
Return type:: bool

class medcat.components.addons.meta_cat.MetaCATAddon(config, base_tokenizer, meta_cat)

Bases: medcat.components.addons.addons.AddonComponent

Base/abstract addon component class.

Parameters:

config (medcat.config.config_meta_cat.ConfigMetaCAT)
base_tokenizer (medcat.tokenizing.tokenizers.BaseTokenizer)
meta_cat (Optional[MetaCAT])

addon_type = 'meta_cat'

output_key = 'meta_anns'

config: medcat.config.config_meta_cat.ConfigMetaCAT

__init__(config, base_tokenizer, meta_cat)

Parameters:

config (medcat.config.config_meta_cat.ConfigMetaCAT)
base_tokenizer (medcat.tokenizing.tokenizers.BaseTokenizer)
meta_cat (Optional[MetaCAT])

Return type:

None

base_tokenizer

_mc

_name

property mc: MetaCAT

Return type:: MetaCAT

classmethod create_new(config, base_tokenizer, tknzer_preprocessor=None)

Factory method to create a new MetaCATAddon instance.

Parameters:

config (medcat.config.config_meta_cat.ConfigMetaCAT)
base_tokenizer (medcat.tokenizing.tokenizers.BaseTokenizer)
tknzer_preprocessor (TokenizerPreprocessor)

Return type:

MetaCATAddon

classmethod create_new_component(cnf, tokenizer, cdb, vocab, model_load_path)

Create a new component or load one off disk if load path presented.

This may raise an exception if the wrong type of config is provided.

Parameters:

cnf (ComponentConfig) – The config relevant to this components.
tokenizer (BaseTokenizer) – The base tokenizer.
cdb (CDB) – The CDB.
vocab (Vocab) – The Vocab.
model_load_path (Optional[str]) – Model load path (if present).

Returns:

Self – The new components.

Return type:

MetaCATAddon

classmethod load_existing(cnf, base_tokenizer, load_path)

Factory method to load an existing MetaCATAddon from disk.

Parameters:

cnf (medcat.config.config_meta_cat.ConfigMetaCAT)
base_tokenizer (medcat.tokenizing.tokenizers.BaseTokenizer)
load_path (str)

Return type:

MetaCATAddon

property name: str

The name of the component.

Return type:: str

__call__(doc)

Parameters:: doc (medcat.tokenizing.tokens.MutableDocument)
Return type:: medcat.tokenizing.tokens.MutableDocument

load(folder_path)

Parameters:: folder_path (str)
Return type:: MetaCAT

classmethod _load_tokenizer(config, tokenizer_folder)

Parameters:

config (medcat.config.config_meta_cat.ConfigMetaCAT)
tokenizer_folder (str)

Return type:

Optional[medcat.components.addons.meta_cat.mctokenizers.tokenizers.TokenizerWrapperBase]

classmethod _get_meta_cat_and_tokenizer_paths(folder_path)

Parameters:: folder_path (str)
Return type:: tuple[str, str]

save(folder_path)

Parameters:: folder_path (str)
Return type:: None

_init_data_paths()

property include_in_output: bool

Return type:: bool

get_output_key_val(ent)

Parameters:: ent (medcat.tokenizing.tokens.MutableEntity)
Return type:: tuple[str, dict[str, MetaAnnotationValue]]

serialise_to(folder_path)

Parameters:: folder_path (str)
Return type:: None

classmethod deserialise_from(folder_path, **init_kwargs)

Parameters:: folder_path (str)
Return type:: MetaCATAddon

get_strategy()

Return type:: medcat.storage.serialisables.SerialisingStrategy

classmethod get_init_attrs()

Return type:: list[str]

classmethod ignore_attrs()

Return type:: list[str]

classmethod include_properties()

Return type:: list[str]

get_hash()

Return type:: str

NAME_PREFIX: str = 'addon_'

NAME_SPLITTER: str = '.'

is_core()

Whether the component is a core component or not.

Returns:: bool – Whether this is a core component.
Return type:: bool

classmethod get_folder_name_for_addon_and_name(addon_type, name)

Parameters:

addon_type (str)
name (str)

Return type:

str

get_folder_name()

Return type:: str

property full_name: str

Name with the component type (e.g ner, linking, meta).

Return type:: str

__slots__ = ()

_is_protocol = True

_is_runtime_protocol = False

classmethod __init_subclass__(*args, **kwargs)

classmethod __class_getitem__(params)

medcat.components.addons.meta_cat.get_meta_annotations(entity)

Parameters:: entity (medcat.tokenizing.tokens.MutableEntity)
Return type:: dict[str, MetaAnnotationValue]

class medcat.components.addons.meta_cat.MetaAnnotationValue

Bases: TypedDict

dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object’s

(key, value) pairs

dict(iterable) -> new dictionary initialized as if via:: d = {} for k, v in iterable:

d[k] = v
dict(**kwargs) -> new dictionary initialized with the name=value pairs: in the keyword argument list. For example: dict(one=1, two=2)

name: str

value: str

confidence: float

__contains__(): True if the dictionary has the specified key, else False.

__delattr__(): Implement delattr(self, name).

__delitem__(): Delete self[key].

__dir__(): Default dir() implementation.

__eq__(): Return self==value.

__format__(): Default object formatter.

__ge__(): Return self>=value.

__getattribute__(): Return getattr(self, name).

__getitem__(): x.__getitem__(y) <==> x[y]

__gt__(): Return self>value.

__init__(): Initialize self. See help(type(self)) for accurate signature.

__ior__(): Return self|=value.

__iter__(): Implement iter(self).

__le__(): Return self<=value.

__len__(): Return len(self).

__lt__(): Return self<value.

__ne__(): Return self!=value.

__new__(): Create and return a new object. See help(type) for accurate signature.

__or__(): Return self|value.

__reduce__(): Helper for pickle.

__reduce_ex__(): Helper for pickle.

__repr__(): Return repr(self).

__reversed__(): Return a reverse iterator over the dict keys.

__ror__(): Return value|self.

__setattr__(): Implement setattr(self, name, value).

__setitem__(): Set self[key] to value.

__sizeof__(): D.__sizeof__() -> size of D in memory, in bytes

__str__(): Return str(self).

__subclasshook__()

Abstract classes can override this to customize issubclass().

This is invoked early on by abc.ABCMeta.__subclasscheck__(). It should return True, False or NotImplemented. If it returns NotImplemented, the normal algorithm is used. Otherwise, it overrides the normal algorithm (and the outcome is cached).

clear(): D.clear() -> None. Remove all items from D.

copy(): D.copy() -> a shallow copy of D

get(): Return the value for key if key is in the dictionary, else default.

items(): D.items() -> a set-like object providing a view on D’s items

keys(): D.keys() -> a set-like object providing a view on D’s keys

pop()

D.pop(k[,d]) -> v, remove specified key and return the corresponding value.

If the key is not found, return the default if given; otherwise, raise a KeyError.

popitem()

Remove and return a (key, value) pair as a 2-tuple.

Pairs are returned in LIFO (last-in, first-out) order. Raises KeyError if the dict is empty.

setdefault()

Insert key with a value of default if key is not in the dictionary.

Return the value for key if key is in the dictionary, else default.

update(): D.update([E, ]**F) -> None. Update D from dict/iterable E and F. If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values(): D.values() -> an object providing a view on D’s values

medcat.components.addons.meta_cat.__all__ = ['MetaCAT', 'MetaCATAddon', 'get_meta_annotations', 'MetaAnnotationValue']

medcat.components.addons.meta_cat._EXTRA_NAME = 'meta-cat'