medcat.components.addons.meta_cat

Submodules

Attributes

__all__

_EXTRA_NAME

Classes

MetaCAT

The MetaCAT class used for training 'Meta-Annotation' models,

MetaCATAddon

Base/abstract addon component class.

MetaAnnotationValue

dict() -> new empty dictionary

Functions

ensure_optional_extras_installed(package_name, extra_name)

Ensure that an optional dependency set is installed.

get_meta_annotations(entity)

Package Contents

medcat.components.addons.meta_cat.ensure_optional_extras_installed(package_name, extra_name)

Ensure that an optional dependency set is installed.

Parameters:
  • package_name (str) – The base package name.

  • extra_name (str) – The name of the extra dependency.

Raises:

MissingDependenciesError – If the extra dependency isn’t provided.

class medcat.components.addons.meta_cat.MetaCAT(tokenizer=None, embeddings=None, config=None, _model_state_dict=None)

Bases: medcat.storage.serialisables.AbstractSerialisable

The MetaCAT class used for training ‘Meta-Annotation’ models, i.e. annotations of clinical concept annotations. These are also known as properties or attributes of recognise entities sin similar tools such as MetaMap and cTakes.

This is a flexible model agnostic class that can learns any meta-annotation task, i.e. any multi-class classification task for recognised terms.

Parameters:
  • tokenizer (TokenizerWrapperBase) –

    The Huggingface tokenizer instance. This can be a pre-trained tokenzier instance from a BERT-style model, or trained from scratch for the Bi-LSTM (w. attention) model that is currentl

    used in most deployments.

  • embeddings (Tensor, numpy.ndarray) – embedding mapping (sub)word input id n-dim (sub)word embedding.

  • config (ConfigMetaCAT) – the configuration for MetaCAT. Param descriptions available in ConfigMetaCAT docs.

  • _model_state_dict (Optional[dict[str, Any]])

name = 'meta_cat'
_component_lock
classmethod get_init_attrs()
Return type:

list[str]

classmethod ignore_attrs()
Return type:

list[str]

classmethod include_properties()
Return type:

list[str]

property _model_state_dict
__init__(tokenizer=None, embeddings=None, config=None, _model_state_dict=None)
Parameters:
Return type:

None

config = None
tokenizer = None
embeddings
model
_reset_tokenizer_info()
get_model(embeddings)

Get the model

Parameters:

embeddings (Optional[Tensor]) – The embedding densor

Raises:

ValueError – If the meta model is not LSTM or BERT

Returns:

nn.Module – The module

Return type:

torch.nn.Module

get_hash()

A partial hash trying to catch differences between models.

Returns:

str – The hex hash.

Return type:

str

train_from_json(json_path, save_dir_path=None, data_oversampled=None, overwrite=False)

Train or continue training a model give a json_path containing a MedCATtrainer export. It will continue training if an existing model is loaded or start new training if the model is blank/new.

Parameters:
  • json_path (Union[str, list]) – Path/Paths to a MedCATtrainer export containing the meta_annotations we want to train for.

  • save_dir_path (Optional[str]) – In case we have aut_save_model (meaning during the training the best model will be saved) we need to set a save path. Defaults to None.

  • data_oversampled (Optional[list]) – In case of oversampling being performed, the data will be passed in the parameter allowing the model to be trained on original + synthetic data.

  • overwrite (bool) – Whether to allow overwriting the file if/when appropriate.

Returns:

dict – The resulting report.

Return type:

dict

train_raw(data_loaded, save_dir_path=None, data_oversampled=None, overwrite=False)

Train or continue training a model given raw data. It will continue training if an existing model is loaded or start new training if the model is blank/new.

The raw data is expected in the following format: {

‘projects’: [ # list of projects
{

‘name’: ‘<project_name>’, ‘documents’: [ # list of documents

{

‘name’: ‘<document_name>’, ‘text’: ‘<text_of_document>’, ‘annotations’: [ # list of annotations

{

# start index of the annotation ‘start’: -1, ‘end’: 1, # end index of the annotation ‘cui’: ‘cui’, ‘value’: ‘<annotation_value>’

],

]

]

}

Parameters:
  • data_loaded (dict) – The raw data we want to train for.

  • save_dir_path (Optional[str]) – In case we have aut_save_model (meaning during the training the best model will be saved) we need to set a save path. Defaults to None.

  • data_oversampled (Optional[list]) –

    In case of oversampling being performed, the data will be passed in the parameter allowing the model to be trained on original + synthetic data. The format of which is expected: [[[‘text’,’of’,’the’,’document’], [index of medical entity],

    ”label” ],

    [‘text’,’of’,’the’,’document’], [index of medical entity],

    ”label” ]]

  • overwrite (bool) – Whether to allow overwriting the file if/when appropriate.

Returns:

dict – The resulting report.

Raises:
  • Exception – If no save path is specified, or category name not in data.

  • AssertionError – If no tokeniser is set

  • FileNotFoundError – If phase_number is set to 2 and model.dat file is not found

  • KeyError – If phase_number is set to 2 and model.dat file contains mismatched architecture

Return type:

dict

eval(json_path)

Evaluate from json.

Parameters:

json_path (str) – The json file ath

Returns:

dict – The resulting model dict

Raises:
  • AssertionError – If self.tokenizer

  • Exception – If the category name does not exist

Return type:

dict

get_ents(doc)
Parameters:

doc (medcat.tokenizing.tokens.MutableDocument)

Return type:

Iterable[medcat.tokenizing.tokens.MutableEntity]

prepare_document(doc, input_ids, offset_mapping, lowercase)

Prepares document.

Parameters:
  • doc (Doc) – The document

  • input_ids (list) – Input ids

  • offset_mapping (list) – Offset mappings

  • lowercase (bool) – Whether to use lower case replace center

Returns:

tuple[dict, list] – Entity id to index mapping and Samples

Return type:

tuple[dict, list]

static batch_generator(stream, batch_size_chars)

Generator for batch of documents.

Parameters:
  • stream (Iterable[MutableDocument]) – The document stream

  • batch_size_chars (int) – Number of characters per batch

Yields:

list[MutableDocument] – The batch of documents.

Return type:

Iterable[list[medcat.tokenizing.tokens.MutableDocument]]

_set_meta_anns(doc, id2category_value)
Parameters:
Return type:

medcat.tokenizing.tokens.MutableDocument

__call__(doc)

Process one document, used in the spacy pipeline for sequential document processing.

Parameters:

doc (Doc) – A spacy document

Returns:

Doc – The same spacy document.

Return type:

medcat.tokenizing.tokens.MutableDocument

get_model_card(as_dict=False)

A minimal model card.

Parameters:

as_dict (bool) – Return the model card as a dictionary instead of a str. Defaults to False.

Returns:

Union[str, dict] – An indented JSON object. OR A JSON object in dict form.

Return type:

Union[str, dict]

__repr__()

Prints the model_card for this MetaCAT instance.

Returns:
  • the ‘Model Card’ for this MetaCAT instance. This includes NER+L

  • config and any MetaCATs

get_strategy()
Return type:

SerialisingStrategy

__eq__(other)
Parameters:

other (Any)

Return type:

bool

class medcat.components.addons.meta_cat.MetaCATAddon(config, base_tokenizer, meta_cat)

Bases: medcat.components.addons.addons.AddonComponent

Base/abstract addon component class.

Parameters:
addon_type = 'meta_cat'
output_key = 'meta_anns'
config: medcat.config.config_meta_cat.ConfigMetaCAT
__init__(config, base_tokenizer, meta_cat)
Parameters:
Return type:

None

base_tokenizer
_mc
_name
property mc: MetaCAT
Return type:

MetaCAT

classmethod create_new(config, base_tokenizer, tknzer_preprocessor=None)

Factory method to create a new MetaCATAddon instance.

Parameters:
Return type:

MetaCATAddon

classmethod create_new_component(cnf, tokenizer, cdb, vocab, model_load_path)

Create a new component or load one off disk if load path presented.

This may raise an exception if the wrong type of config is provided.

Parameters:
  • cnf (ComponentConfig) – The config relevant to this components.

  • tokenizer (BaseTokenizer) – The base tokenizer.

  • cdb (CDB) – The CDB.

  • vocab (Vocab) – The Vocab.

  • model_load_path (Optional[str]) – Model load path (if present).

Returns:

Self – The new components.

Return type:

MetaCATAddon

classmethod load_existing(cnf, base_tokenizer, load_path)

Factory method to load an existing MetaCATAddon from disk.

Parameters:
Return type:

MetaCATAddon

property name: str

The name of the component.

Return type:

str

__call__(doc)
Parameters:

doc (medcat.tokenizing.tokens.MutableDocument)

Return type:

medcat.tokenizing.tokens.MutableDocument

load(folder_path)
Parameters:

folder_path (str)

Return type:

MetaCAT

classmethod _load_tokenizer(config, tokenizer_folder)
Parameters:
Return type:

Optional[medcat.components.addons.meta_cat.mctokenizers.tokenizers.TokenizerWrapperBase]

classmethod _get_meta_cat_and_tokenizer_paths(folder_path)
Parameters:

folder_path (str)

Return type:

tuple[str, str]

save(folder_path)
Parameters:

folder_path (str)

Return type:

None

_init_data_paths()
property include_in_output: bool
Return type:

bool

get_output_key_val(ent)
Parameters:

ent (medcat.tokenizing.tokens.MutableEntity)

Return type:

tuple[str, dict[str, MetaAnnotationValue]]

serialise_to(folder_path)
Parameters:

folder_path (str)

Return type:

None

classmethod deserialise_from(folder_path, **init_kwargs)
Parameters:

folder_path (str)

Return type:

MetaCATAddon

get_strategy()
Return type:

medcat.storage.serialisables.SerialisingStrategy

classmethod get_init_attrs()
Return type:

list[str]

classmethod ignore_attrs()
Return type:

list[str]

classmethod include_properties()
Return type:

list[str]

get_hash()
Return type:

str

NAME_PREFIX: str = 'addon_'
NAME_SPLITTER: str = '.'
is_core()

Whether the component is a core component or not.

Returns:

bool – Whether this is a core component.

Return type:

bool

classmethod get_folder_name_for_addon_and_name(addon_type, name)
Parameters:
  • addon_type (str)

  • name (str)

Return type:

str

get_folder_name()
Return type:

str

property full_name: str

Name with the component type (e.g ner, linking, meta).

Return type:

str

__slots__ = ()
_is_protocol = True
_is_runtime_protocol = False
classmethod __init_subclass__(*args, **kwargs)
classmethod __class_getitem__(params)
medcat.components.addons.meta_cat.get_meta_annotations(entity)
Parameters:

entity (medcat.tokenizing.tokens.MutableEntity)

Return type:

dict[str, MetaAnnotationValue]

class medcat.components.addons.meta_cat.MetaAnnotationValue

Bases: TypedDict

dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object’s

(key, value) pairs

dict(iterable) -> new dictionary initialized as if via:

d = {} for k, v in iterable:

d[k] = v

dict(**kwargs) -> new dictionary initialized with the name=value pairs

in the keyword argument list. For example: dict(one=1, two=2)

name: str
value: str
confidence: float
__contains__()

True if the dictionary has the specified key, else False.

__delattr__()

Implement delattr(self, name).

__delitem__()

Delete self[key].

__dir__()

Default dir() implementation.

__eq__()

Return self==value.

__format__()

Default object formatter.

__ge__()

Return self>=value.

__getattribute__()

Return getattr(self, name).

__getitem__()

x.__getitem__(y) <==> x[y]

__gt__()

Return self>value.

__init__()

Initialize self. See help(type(self)) for accurate signature.

__ior__()

Return self|=value.

__iter__()

Implement iter(self).

__le__()

Return self<=value.

__len__()

Return len(self).

__lt__()

Return self<value.

__ne__()

Return self!=value.

__new__()

Create and return a new object. See help(type) for accurate signature.

__or__()

Return self|value.

__reduce__()

Helper for pickle.

__reduce_ex__()

Helper for pickle.

__repr__()

Return repr(self).

__reversed__()

Return a reverse iterator over the dict keys.

__ror__()

Return value|self.

__setattr__()

Implement setattr(self, name, value).

__setitem__()

Set self[key] to value.

__sizeof__()

D.__sizeof__() -> size of D in memory, in bytes

__str__()

Return str(self).

__subclasshook__()

Abstract classes can override this to customize issubclass().

This is invoked early on by abc.ABCMeta.__subclasscheck__(). It should return True, False or NotImplemented. If it returns NotImplemented, the normal algorithm is used. Otherwise, it overrides the normal algorithm (and the outcome is cached).

clear()

D.clear() -> None. Remove all items from D.

copy()

D.copy() -> a shallow copy of D

get()

Return the value for key if key is in the dictionary, else default.

items()

D.items() -> a set-like object providing a view on D’s items

keys()

D.keys() -> a set-like object providing a view on D’s keys

pop()

D.pop(k[,d]) -> v, remove specified key and return the corresponding value.

If the key is not found, return the default if given; otherwise, raise a KeyError.

popitem()

Remove and return a (key, value) pair as a 2-tuple.

Pairs are returned in LIFO (last-in, first-out) order. Raises KeyError if the dict is empty.

setdefault()

Insert key with a value of default if key is not in the dictionary.

Return the value for key if key is in the dictionary, else default.

update()

D.update([E, ]**F) -> None. Update D from dict/iterable E and F. If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values()

D.values() -> an object providing a view on D’s values

medcat.components.addons.meta_cat.__all__ = ['MetaCAT', 'MetaCATAddon', 'get_meta_annotations', 'MetaAnnotationValue']
medcat.components.addons.meta_cat._EXTRA_NAME = 'meta-cat'