medcat.components.addons.meta_cat.data_utils
============================================

.. py:module:: medcat.components.addons.meta_cat.data_utils


Attributes
----------

.. autoapisummary::

   medcat.components.addons.meta_cat.data_utils.logger


Classes
-------

.. autoapisummary::

   medcat.components.addons.meta_cat.data_utils.TokenizerWrapperBase


Functions
---------

.. autoapisummary::

   medcat.components.addons.meta_cat.data_utils.prepare_from_json
   medcat.components.addons.meta_cat.data_utils.prepare_for_oversampled_data
   medcat.components.addons.meta_cat.data_utils.encode_category_values


Module Contents
---------------

.. py:class:: TokenizerWrapperBase(hf_tokenizer = None)

   Bases: :py:obj:`abc.ABC`


   Helper class that provides a standard way to create an ABC using
   inheritance.


   .. py:attribute:: name
      :type:  str


   .. py:method:: __init__(hf_tokenizer = None)


   .. py:attribute:: hf_tokenizers
      :value: None


   .. py:method:: __call__(text: str) -> dict
                  __call__(text: list[str]) -> list[dict]


   .. py:method:: save(dir_path)
      :abstractmethod:


   .. py:method:: load(dir_path, model_variant = '', **kwargs)
      :classmethod:

      :abstractmethod:


   .. py:method:: get_size()
      :abstractmethod:


   .. py:method:: token_to_id(token)
      :abstractmethod:


   .. py:method:: get_pad_id()
      :abstractmethod:


   .. py:method:: ensure_tokenizer()


   .. py:attribute:: __slots__
      :value: ()


.. py:data:: logger

.. py:function:: prepare_from_json(data, cntx_left, cntx_right, tokenizer, cui_filter = None, replace_center = None, prerequisites = {}, lowercase = True)

   Convert the data from a json format into a CSV-like format for
   training. This function is not very efficient (the one working with
   documents as part of the meta_cat.pipe method is much better).
   If your dataset is > 1M documents think about rewriting this function
   - but would be strange to have more than 1M manually annotated documents.

   :param data: Loaded output of MedCATtrainer. If we have a `my_export.json`
                from MedCATtrainer, than data = json.load(<my_export>).
   :type data: dict
   :param cntx_left: Size of context to get from the left of the concept
   :type cntx_left: int
   :param cntx_right: Size of context to get from the right of the concept
   :type cntx_right: int
   :param tokenizer: Something to split text into tokens for the LSTM/BERT/whatever
                     meta models.
   :type tokenizer: TokenizerWrapperBase
   :param replace_center: If not None the center word (concept) will be replaced with
                          whatever this is.
   :type replace_center: Optional[str]
   :param prerequisites: A map of prerequisites, for example our data has two
                         meta-annotations (experiencer, negation). Assume I want to create
                         a dataset for `negation` but only in those cases where
                         `experiencer=patient`, my prerequisites would be:
                             {'Experiencer': 'Patient'} - Take care that the CASE has to
                                         match whatever is in the data. Defaults to `{}`.
   :type prerequisites: dict
   :param lowercase: Should the text be lowercased before tokenization.
                     Defaults to True.
   :type lowercase: bool
   :param cui_filter: CUI filter if set. Defaults to None.
   :type cui_filter: Optional[set]

   :Returns: **out_data** (*dict*) --

             Example: {'category_name': [('<category_value>', '<[tokens]>',
                         '<center_token>'), ...], ...}


.. py:function:: prepare_for_oversampled_data(data, tokenizer)

   Convert the data from a json format into a CSV-like format for
   training. This function is not very efficient (the one working with
   documents as part of the meta_cat.pipe method is much better).
   If your dataset is > 1M documents think about rewriting this function -
   but would be strange to have more than 1M manually annotated documents.

   :param data: Oversampled data expected in the following format:
                [[['text','of','the','document'], [index of medical entity],
                     "label" ],
                 ['text','of','the','document'], [index of medical entity],
                     "label" ]]
   :type data: list
   :param tokenizer: Something to split text into tokens for the LSTM/BERT/whatever
                     meta models.
   :type tokenizer: TokenizerWrapperBase

   :Returns: **data_sampled** (*list*) -- The processed data in the format that can be merged with the
             output from prepare_from_json.
             [[<[tokens]>, [index of medical entity], "label" ],
             <[tokens]>, [index of medical entity], "label" ]]


.. py:function:: encode_category_values(data, existing_category_value2id = None, category_undersample=None, alternative_class_names = [])

   Converts the category values in the data outputted by
   `prepare_from_json` into integer values.

   :param data: Output of `prepare_from_json`.
   :type data: dict
   :param existing_category_value2id: Map from category_value to id (old/existing).
   :type existing_category_value2id: Optional[dict]
   :param category_undersample: Name of class that should be used to undersample the data (for 2
                                phase learning)
   :param alternative_class_names: A list of lists of strings, where each list contains variations
                                   of a class name. Usually read from the config at
                                   `config.general.alternative_class_names`.
   :type alternative_class_names: list[list[str]]

   :Returns: * **dict** -- New data with integers inplace of strings for category values.
             * **dict** -- New undersampled data (for 2 phase learning) with integers
               inplace of strings for category values
             * **dict** -- Map from category value to ID for all categories in the data.

   :raises Exception: If categoryvalue2id is pre-defined and its labels do
       not match the labels found in the data