medcat.components.addons.meta_cat.data_utils

Attributes

logger

Classes

TokenizerWrapperBase

Helper class that provides a standard way to create an ABC using

Functions

prepare_from_json(data, cntx_left, cntx_right, tokenizer)

Convert the data from a json format into a CSV-like format for

prepare_for_oversampled_data(data, tokenizer)

Convert the data from a json format into a CSV-like format for

encode_category_values(data[, ...])

Converts the category values in the data outputted by

Module Contents

class medcat.components.addons.meta_cat.data_utils.TokenizerWrapperBase(hf_tokenizer=None)

Bases: abc.ABC

Helper class that provides a standard way to create an ABC using inheritance.

Parameters:

hf_tokenizer (Optional[tokenizers.Tokenizer])

name: str
__init__(hf_tokenizer=None)
Parameters:

hf_tokenizer (Optional[tokenizers.Tokenizer])

Return type:

None

hf_tokenizers = None
__call__(text: str) dict
__call__(text: list[str]) list[dict]
abstract save(dir_path)
Parameters:

dir_path (str)

Return type:

None

classmethod load(dir_path, model_variant='', **kwargs)
Abstractmethod:

Parameters:
  • dir_path (str)

  • model_variant (Optional[str])

Return type:

tokenizers.Tokenizer

abstract get_size()
Return type:

int

abstract token_to_id(token)
Parameters:

token (str)

Return type:

Union[int, list[int]]

abstract get_pad_id()
Return type:

Union[Optional[int], list[int]]

ensure_tokenizer()
Return type:

tokenizers.Tokenizer

__slots__ = ()
medcat.components.addons.meta_cat.data_utils.logger
medcat.components.addons.meta_cat.data_utils.prepare_from_json(data, cntx_left, cntx_right, tokenizer, cui_filter=None, replace_center=None, prerequisites={}, lowercase=True)

Convert the data from a json format into a CSV-like format for training. This function is not very efficient (the one working with documents as part of the meta_cat.pipe method is much better). If your dataset is > 1M documents think about rewriting this function - but would be strange to have more than 1M manually annotated documents.

Parameters:
  • data (dict) – Loaded output of MedCATtrainer. If we have a my_export.json from MedCATtrainer, than data = json.load(<my_export>).

  • cntx_left (int) – Size of context to get from the left of the concept

  • cntx_right (int) – Size of context to get from the right of the concept

  • tokenizer (TokenizerWrapperBase) – Something to split text into tokens for the LSTM/BERT/whatever meta models.

  • replace_center (Optional[str]) – If not None the center word (concept) will be replaced with whatever this is.

  • prerequisites (dict) –

    A map of prerequisites, for example our data has two meta-annotations (experiencer, negation). Assume I want to create a dataset for negation but only in those cases where experiencer=patient, my prerequisites would be:

    {‘Experiencer’: ‘Patient’} - Take care that the CASE has to

    match whatever is in the data. Defaults to {}.

  • lowercase (bool) – Should the text be lowercased before tokenization. Defaults to True.

  • cui_filter (Optional[set]) – CUI filter if set. Defaults to None.

Returns:

out_data (dict) –

Example: {‘category_name’: [(‘<category_value>’, ‘<[tokens]>’,

‘<center_token>’), …], …}

Return type:

dict

medcat.components.addons.meta_cat.data_utils.prepare_for_oversampled_data(data, tokenizer)

Convert the data from a json format into a CSV-like format for training. This function is not very efficient (the one working with documents as part of the meta_cat.pipe method is much better). If your dataset is > 1M documents think about rewriting this function - but would be strange to have more than 1M manually annotated documents.

Parameters:
  • data (list) –

    Oversampled data expected in the following format: [[[‘text’,’of’,’the’,’document’], [index of medical entity],

    ”label” ],

    [‘text’,’of’,’the’,’document’], [index of medical entity],

    ”label” ]]

  • tokenizer (TokenizerWrapperBase) – Something to split text into tokens for the LSTM/BERT/whatever meta models.

Returns:

data_sampled (list) – The processed data in the format that can be merged with the output from prepare_from_json. [[<[tokens]>, [index of medical entity], “label” ], <[tokens]>, [index of medical entity], “label” ]]

Return type:

list

medcat.components.addons.meta_cat.data_utils.encode_category_values(data, existing_category_value2id=None, category_undersample=None, alternative_class_names=[])

Converts the category values in the data outputted by prepare_from_json into integer values.

Parameters:
  • data (dict) – Output of prepare_from_json.

  • existing_category_value2id (Optional[dict]) – Map from category_value to id (old/existing).

  • category_undersample – Name of class that should be used to undersample the data (for 2 phase learning)

  • alternative_class_names (list[list[str]]) – A list of lists of strings, where each list contains variations of a class name. Usually read from the config at config.general.alternative_class_names.

Returns:
  • dict – New data with integers inplace of strings for category values.

  • dict – New undersampled data (for 2 phase learning) with integers inplace of strings for category values

  • dict – Map from category value to ID for all categories in the data.

Raises:

Exception – If categoryvalue2id is pre-defined and its labels do not match the labels found in the data

Return type:

tuple