medcat.components.addons.meta_cat
Submodules
Attributes
Classes
The MetaCAT class used for training 'Meta-Annotation' models, |
|
Base/abstract addon component class. |
|
dict() -> new empty dictionary |
Functions
|
Ensure that an optional dependency set is installed. |
|
Package Contents
- medcat.components.addons.meta_cat.ensure_optional_extras_installed(package_name, extra_name)
Ensure that an optional dependency set is installed.
- Parameters:
package_name (str) – The base package name.
extra_name (str) – The name of the extra dependency.
- Raises:
MissingDependenciesError – If the extra dependency isn’t provided.
- class medcat.components.addons.meta_cat.MetaCAT(tokenizer=None, embeddings=None, config=None, _model_state_dict=None)
Bases:
medcat.storage.serialisables.AbstractSerialisableThe MetaCAT class used for training ‘Meta-Annotation’ models, i.e. annotations of clinical concept annotations. These are also known as properties or attributes of recognise entities sin similar tools such as MetaMap and cTakes.
This is a flexible model agnostic class that can learns any meta-annotation task, i.e. any multi-class classification task for recognised terms.
- Parameters:
tokenizer (TokenizerWrapperBase) –
The Huggingface tokenizer instance. This can be a pre-trained tokenzier instance from a BERT-style model, or trained from scratch for the Bi-LSTM (w. attention) model that is currentl
used in most deployments.
embeddings (Tensor, numpy.ndarray) – embedding mapping (sub)word input id n-dim (sub)word embedding.
config (ConfigMetaCAT) – the configuration for MetaCAT. Param descriptions available in ConfigMetaCAT docs.
_model_state_dict (Optional[dict[str, Any]])
- name = 'meta_cat'
- _component_lock
- classmethod get_init_attrs()
- Return type:
list[str]
- classmethod ignore_attrs()
- Return type:
list[str]
- classmethod include_properties()
- Return type:
list[str]
- property _model_state_dict
- __init__(tokenizer=None, embeddings=None, config=None, _model_state_dict=None)
- Parameters:
tokenizer (Optional[medcat.components.addons.meta_cat.mctokenizers.tokenizers.TokenizerWrapperBase])
embeddings (Optional[Union[torch.Tensor, numpy.ndarray]])
config (Optional[medcat.config.config_meta_cat.ConfigMetaCAT])
_model_state_dict (Optional[dict[str, Any]])
- Return type:
None
- config = None
- tokenizer = None
- embeddings
- model
- _reset_tokenizer_info()
- get_model(embeddings)
Get the model
- Parameters:
embeddings (Optional[Tensor]) – The embedding densor
- Raises:
ValueError – If the meta model is not LSTM or BERT
- Returns:
nn.Module – The module
- Return type:
torch.nn.Module
- get_hash()
A partial hash trying to catch differences between models.
- Returns:
str – The hex hash.
- Return type:
str
- train_from_json(json_path, save_dir_path=None, data_oversampled=None, overwrite=False)
Train or continue training a model give a json_path containing a MedCATtrainer export. It will continue training if an existing model is loaded or start new training if the model is blank/new.
- Parameters:
json_path (Union[str, list]) – Path/Paths to a MedCATtrainer export containing the meta_annotations we want to train for.
save_dir_path (Optional[str]) – In case we have aut_save_model (meaning during the training the best model will be saved) we need to set a save path. Defaults to None.
data_oversampled (Optional[list]) – In case of oversampling being performed, the data will be passed in the parameter allowing the model to be trained on original + synthetic data.
overwrite (bool) – Whether to allow overwriting the file if/when appropriate.
- Returns:
dict – The resulting report.
- Return type:
dict
- train_raw(data_loaded, save_dir_path=None, data_oversampled=None, overwrite=False)
Train or continue training a model given raw data. It will continue training if an existing model is loaded or start new training if the model is blank/new.
The raw data is expected in the following format: {
- ‘projects’: [ # list of projects
- {
‘name’: ‘<project_name>’, ‘documents’: [ # list of documents
- {
‘name’: ‘<document_name>’, ‘text’: ‘<text_of_document>’, ‘annotations’: [ # list of annotations
- {
# start index of the annotation ‘start’: -1, ‘end’: 1, # end index of the annotation ‘cui’: ‘cui’, ‘value’: ‘<annotation_value>’
],
]
]
}
- Parameters:
data_loaded (dict) – The raw data we want to train for.
save_dir_path (Optional[str]) – In case we have aut_save_model (meaning during the training the best model will be saved) we need to set a save path. Defaults to None.
data_oversampled (Optional[list]) –
In case of oversampling being performed, the data will be passed in the parameter allowing the model to be trained on original + synthetic data. The format of which is expected: [[[‘text’,’of’,’the’,’document’], [index of medical entity],
”label” ],
- [‘text’,’of’,’the’,’document’], [index of medical entity],
”label” ]]
overwrite (bool) – Whether to allow overwriting the file if/when appropriate.
- Returns:
dict – The resulting report.
- Raises:
Exception – If no save path is specified, or category name not in data.
AssertionError – If no tokeniser is set
FileNotFoundError – If phase_number is set to 2 and model.dat file is not found
KeyError – If phase_number is set to 2 and model.dat file contains mismatched architecture
- Return type:
dict
- eval(json_path)
Evaluate from json.
- Parameters:
json_path (str) – The json file ath
- Returns:
dict – The resulting model dict
- Raises:
AssertionError – If self.tokenizer
Exception – If the category name does not exist
- Return type:
dict
- get_ents(doc)
- Parameters:
- Return type:
Iterable[medcat.tokenizing.tokens.MutableEntity]
- prepare_document(doc, input_ids, offset_mapping, lowercase)
Prepares document.
- Parameters:
doc (Doc) – The document
input_ids (list) – Input ids
offset_mapping (list) – Offset mappings
lowercase (bool) – Whether to use lower case replace center
- Returns:
tuple[dict, list] – Entity id to index mapping and Samples
- Return type:
tuple[dict, list]
- static batch_generator(stream, batch_size_chars)
Generator for batch of documents.
- Parameters:
stream (Iterable[MutableDocument]) – The document stream
batch_size_chars (int) – Number of characters per batch
- Yields:
list[MutableDocument] – The batch of documents.
- Return type:
Iterable[list[medcat.tokenizing.tokens.MutableDocument]]
- _set_meta_anns(doc, id2category_value)
- Parameters:
id2category_value (dict)
- Return type:
- __call__(doc)
Process one document, used in the spacy pipeline for sequential document processing.
- Parameters:
doc (Doc) – A spacy document
- Returns:
Doc – The same spacy document.
- Return type:
- get_model_card(as_dict=False)
A minimal model card.
- Parameters:
as_dict (bool) – Return the model card as a dictionary instead of a str. Defaults to False.
- Returns:
Union[str, dict] – An indented JSON object. OR A JSON object in dict form.
- Return type:
Union[str, dict]
- __repr__()
Prints the model_card for this MetaCAT instance.
- Returns:
the ‘Model Card’ for this MetaCAT instance. This includes NER+L
config and any MetaCATs
- get_strategy()
- Return type:
- __eq__(other)
- Parameters:
other (Any)
- Return type:
bool
- class medcat.components.addons.meta_cat.MetaCATAddon(config, base_tokenizer, meta_cat)
Bases:
medcat.components.addons.addons.AddonComponentBase/abstract addon component class.
- Parameters:
base_tokenizer (medcat.tokenizing.tokenizers.BaseTokenizer)
meta_cat (Optional[MetaCAT])
- addon_type = 'meta_cat'
- output_key = 'meta_anns'
- __init__(config, base_tokenizer, meta_cat)
- Parameters:
base_tokenizer (medcat.tokenizing.tokenizers.BaseTokenizer)
meta_cat (Optional[MetaCAT])
- Return type:
None
- base_tokenizer
- _mc
- _name
- classmethod create_new(config, base_tokenizer, tknzer_preprocessor=None)
Factory method to create a new MetaCATAddon instance.
- Parameters:
base_tokenizer (medcat.tokenizing.tokenizers.BaseTokenizer)
tknzer_preprocessor (TokenizerPreprocessor)
- Return type:
- classmethod create_new_component(cnf, tokenizer, cdb, vocab, model_load_path)
Create a new component or load one off disk if load path presented.
This may raise an exception if the wrong type of config is provided.
- Parameters:
cnf (ComponentConfig) – The config relevant to this components.
tokenizer (BaseTokenizer) – The base tokenizer.
cdb (CDB) – The CDB.
vocab (Vocab) – The Vocab.
model_load_path (Optional[str]) – Model load path (if present).
- Returns:
Self – The new components.
- Return type:
- classmethod load_existing(cnf, base_tokenizer, load_path)
Factory method to load an existing MetaCATAddon from disk.
- Parameters:
base_tokenizer (medcat.tokenizing.tokenizers.BaseTokenizer)
load_path (str)
- Return type:
- property name: str
The name of the component.
- Return type:
str
- __call__(doc)
- Parameters:
- Return type:
- classmethod _load_tokenizer(config, tokenizer_folder)
- Parameters:
tokenizer_folder (str)
- Return type:
Optional[medcat.components.addons.meta_cat.mctokenizers.tokenizers.TokenizerWrapperBase]
- classmethod _get_meta_cat_and_tokenizer_paths(folder_path)
- Parameters:
folder_path (str)
- Return type:
tuple[str, str]
- save(folder_path)
- Parameters:
folder_path (str)
- Return type:
None
- _init_data_paths()
- property include_in_output: bool
- Return type:
bool
- get_output_key_val(ent)
- Parameters:
- Return type:
tuple[str, dict[str, MetaAnnotationValue]]
- serialise_to(folder_path)
- Parameters:
folder_path (str)
- Return type:
None
- classmethod deserialise_from(folder_path, **init_kwargs)
- Parameters:
folder_path (str)
- Return type:
- get_strategy()
- Return type:
- classmethod get_init_attrs()
- Return type:
list[str]
- classmethod ignore_attrs()
- Return type:
list[str]
- classmethod include_properties()
- Return type:
list[str]
- get_hash()
- Return type:
str
- NAME_PREFIX: str = 'addon_'
- NAME_SPLITTER: str = '.'
- is_core()
Whether the component is a core component or not.
- Returns:
bool – Whether this is a core component.
- Return type:
bool
- classmethod get_folder_name_for_addon_and_name(addon_type, name)
- Parameters:
addon_type (str)
name (str)
- Return type:
str
- get_folder_name()
- Return type:
str
- property full_name: str
Name with the component type (e.g ner, linking, meta).
- Return type:
str
- __slots__ = ()
- _is_protocol = True
- _is_runtime_protocol = False
- classmethod __init_subclass__(*args, **kwargs)
- classmethod __class_getitem__(params)
- medcat.components.addons.meta_cat.get_meta_annotations(entity)
- Parameters:
- Return type:
dict[str, MetaAnnotationValue]
- class medcat.components.addons.meta_cat.MetaAnnotationValue
Bases:
TypedDictdict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object’s
(key, value) pairs
- dict(iterable) -> new dictionary initialized as if via:
d = {} for k, v in iterable:
d[k] = v
- dict(**kwargs) -> new dictionary initialized with the name=value pairs
in the keyword argument list. For example: dict(one=1, two=2)
- name: str
- value: str
- confidence: float
- __contains__()
True if the dictionary has the specified key, else False.
- __delattr__()
Implement delattr(self, name).
- __delitem__()
Delete self[key].
- __dir__()
Default dir() implementation.
- __eq__()
Return self==value.
- __format__()
Default object formatter.
- __ge__()
Return self>=value.
- __getattribute__()
Return getattr(self, name).
- __getitem__()
x.__getitem__(y) <==> x[y]
- __gt__()
Return self>value.
- __init__()
Initialize self. See help(type(self)) for accurate signature.
- __ior__()
Return self|=value.
- __iter__()
Implement iter(self).
- __le__()
Return self<=value.
- __len__()
Return len(self).
- __lt__()
Return self<value.
- __ne__()
Return self!=value.
- __new__()
Create and return a new object. See help(type) for accurate signature.
- __or__()
Return self|value.
- __reduce__()
Helper for pickle.
- __reduce_ex__()
Helper for pickle.
- __repr__()
Return repr(self).
- __reversed__()
Return a reverse iterator over the dict keys.
- __ror__()
Return value|self.
- __setattr__()
Implement setattr(self, name, value).
- __setitem__()
Set self[key] to value.
- __sizeof__()
D.__sizeof__() -> size of D in memory, in bytes
- __str__()
Return str(self).
- __subclasshook__()
Abstract classes can override this to customize issubclass().
This is invoked early on by abc.ABCMeta.__subclasscheck__(). It should return True, False or NotImplemented. If it returns NotImplemented, the normal algorithm is used. Otherwise, it overrides the normal algorithm (and the outcome is cached).
- clear()
D.clear() -> None. Remove all items from D.
- copy()
D.copy() -> a shallow copy of D
- get()
Return the value for key if key is in the dictionary, else default.
- items()
D.items() -> a set-like object providing a view on D’s items
- keys()
D.keys() -> a set-like object providing a view on D’s keys
- pop()
D.pop(k[,d]) -> v, remove specified key and return the corresponding value.
If the key is not found, return the default if given; otherwise, raise a KeyError.
- popitem()
Remove and return a (key, value) pair as a 2-tuple.
Pairs are returned in LIFO (last-in, first-out) order. Raises KeyError if the dict is empty.
- setdefault()
Insert key with a value of default if key is not in the dictionary.
Return the value for key if key is in the dictionary, else default.
- update()
D.update([E, ]**F) -> None. Update D from dict/iterable E and F. If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]
- values()
D.values() -> an object providing a view on D’s values
- medcat.components.addons.meta_cat.__all__ = ['MetaCAT', 'MetaCATAddon', 'get_meta_annotations', 'MetaAnnotationValue']
- medcat.components.addons.meta_cat._EXTRA_NAME = 'meta-cat'