medcat.components.addons.meta_cat.data_utils ============================================ .. py:module:: medcat.components.addons.meta_cat.data_utils Attributes ---------- .. autoapisummary:: medcat.components.addons.meta_cat.data_utils.logger Classes ------- .. autoapisummary:: medcat.components.addons.meta_cat.data_utils.TokenizerWrapperBase Functions --------- .. autoapisummary:: medcat.components.addons.meta_cat.data_utils.prepare_from_json medcat.components.addons.meta_cat.data_utils.prepare_for_oversampled_data medcat.components.addons.meta_cat.data_utils.encode_category_values Module Contents --------------- .. py:class:: TokenizerWrapperBase(hf_tokenizer = None) Bases: :py:obj:`abc.ABC` Helper class that provides a standard way to create an ABC using inheritance. .. py:attribute:: name :type: str .. py:method:: __init__(hf_tokenizer = None) .. py:attribute:: hf_tokenizers :value: None .. py:method:: __call__(text: str) -> dict __call__(text: list[str]) -> list[dict] .. py:method:: save(dir_path) :abstractmethod: .. py:method:: load(dir_path, model_variant = '', **kwargs) :classmethod: :abstractmethod: .. py:method:: get_size() :abstractmethod: .. py:method:: token_to_id(token) :abstractmethod: .. py:method:: get_pad_id() :abstractmethod: .. py:method:: ensure_tokenizer() .. py:attribute:: __slots__ :value: () .. py:data:: logger .. py:function:: prepare_from_json(data, cntx_left, cntx_right, tokenizer, cui_filter = None, replace_center = None, prerequisites = {}, lowercase = True) Convert the data from a json format into a CSV-like format for training. This function is not very efficient (the one working with documents as part of the meta_cat.pipe method is much better). If your dataset is > 1M documents think about rewriting this function - but would be strange to have more than 1M manually annotated documents. :param data: Loaded output of MedCATtrainer. If we have a `my_export.json` from MedCATtrainer, than data = json.load(). :type data: dict :param cntx_left: Size of context to get from the left of the concept :type cntx_left: int :param cntx_right: Size of context to get from the right of the concept :type cntx_right: int :param tokenizer: Something to split text into tokens for the LSTM/BERT/whatever meta models. :type tokenizer: TokenizerWrapperBase :param replace_center: If not None the center word (concept) will be replaced with whatever this is. :type replace_center: Optional[str] :param prerequisites: A map of prerequisites, for example our data has two meta-annotations (experiencer, negation). Assume I want to create a dataset for `negation` but only in those cases where `experiencer=patient`, my prerequisites would be: {'Experiencer': 'Patient'} - Take care that the CASE has to match whatever is in the data. Defaults to `{}`. :type prerequisites: dict :param lowercase: Should the text be lowercased before tokenization. Defaults to True. :type lowercase: bool :param cui_filter: CUI filter if set. Defaults to None. :type cui_filter: Optional[set] :Returns: **out_data** (*dict*) -- Example: {'category_name': [('', '<[tokens]>', ''), ...], ...} .. py:function:: prepare_for_oversampled_data(data, tokenizer) Convert the data from a json format into a CSV-like format for training. This function is not very efficient (the one working with documents as part of the meta_cat.pipe method is much better). If your dataset is > 1M documents think about rewriting this function - but would be strange to have more than 1M manually annotated documents. :param data: Oversampled data expected in the following format: [[['text','of','the','document'], [index of medical entity], "label" ], ['text','of','the','document'], [index of medical entity], "label" ]] :type data: list :param tokenizer: Something to split text into tokens for the LSTM/BERT/whatever meta models. :type tokenizer: TokenizerWrapperBase :Returns: **data_sampled** (*list*) -- The processed data in the format that can be merged with the output from prepare_from_json. [[<[tokens]>, [index of medical entity], "label" ], <[tokens]>, [index of medical entity], "label" ]] .. py:function:: encode_category_values(data, existing_category_value2id = None, category_undersample=None, alternative_class_names = []) Converts the category values in the data outputted by `prepare_from_json` into integer values. :param data: Output of `prepare_from_json`. :type data: dict :param existing_category_value2id: Map from category_value to id (old/existing). :type existing_category_value2id: Optional[dict] :param category_undersample: Name of class that should be used to undersample the data (for 2 phase learning) :param alternative_class_names: A list of lists of strings, where each list contains variations of a class name. Usually read from the config at `config.general.alternative_class_names`. :type alternative_class_names: list[list[str]] :Returns: * **dict** -- New data with integers inplace of strings for category values. * **dict** -- New undersampled data (for 2 phase learning) with integers inplace of strings for category values * **dict** -- Map from category value to ID for all categories in the data. :raises Exception: If categoryvalue2id is pre-defined and its labels do not match the labels found in the data