medcat.components.ner.trf.tokenizer
===================================

.. py:module:: medcat.components.ner.trf.tokenizer


Attributes
----------

.. autoapisummary::

   medcat.components.ner.trf.tokenizer.logger


Classes
-------

.. autoapisummary::

   medcat.components.ner.trf.tokenizer.TransformersTokenizer


Module Contents
---------------

.. py:data:: logger

.. py:class:: TransformersTokenizer(hf_tokenizer = None, max_len = 512, id2type = None, cui2name = None)

   Args:
   hf_tokenizer
       Must be able to return token offsets.
   max_len:
       Max sequence length, if longer it will be split into
       multiple examples.
   id2type:
       Can be ignored in most cases, should be a map from token to 'start'
       or 'sub' meaning is the token a subword or the start/full word.
       For BERT 'start' is everything that does not begin with ##.
   cui2name:
       Map from CUI to full name for labels.


   .. py:method:: __init__(hf_tokenizer = None, max_len = 512, id2type = None, cui2name = None)


   .. py:attribute:: hf_tokenizer
      :value: None


   .. py:attribute:: max_len
      :value: 512


   .. py:attribute:: label_map


   .. py:attribute:: id2type
      :value: None


   .. py:attribute:: cui2name
      :value: None


   .. py:method:: calculate_label_map(dataset)


   .. py:method:: encode(examples, ignore_subwords = False)

      Used with huggingface datasets map function to convert medcat_ner
      dataset into the appropriate form for NER with BERT. It will split
      long text segments into max_len sequences (performs chunking).

      :param examples: Stream of examples.
      :type examples: Dict
      :param ignore_subwords: If set to `True` subwords of any token will get the special
                              label `X`.
      :type ignore_subwords: bool

      :Returns: **Dict** -- The same dict, modified.


   .. py:method:: save(path)


   .. py:method:: ensure_tokenizer()


   .. py:method:: load(path)
      :classmethod: