medcat.components.ner.trf.tokenizer =================================== .. py:module:: medcat.components.ner.trf.tokenizer Attributes ---------- .. autoapisummary:: medcat.components.ner.trf.tokenizer.logger Classes ------- .. autoapisummary:: medcat.components.ner.trf.tokenizer.TransformersTokenizer Module Contents --------------- .. py:data:: logger .. py:class:: TransformersTokenizer(hf_tokenizer = None, max_len = 512, id2type = None, cui2name = None) Args: hf_tokenizer Must be able to return token offsets. max_len: Max sequence length, if longer it will be split into multiple examples. id2type: Can be ignored in most cases, should be a map from token to 'start' or 'sub' meaning is the token a subword or the start/full word. For BERT 'start' is everything that does not begin with ##. cui2name: Map from CUI to full name for labels. .. py:method:: __init__(hf_tokenizer = None, max_len = 512, id2type = None, cui2name = None) .. py:attribute:: hf_tokenizer :value: None .. py:attribute:: max_len :value: 512 .. py:attribute:: label_map .. py:attribute:: id2type :value: None .. py:attribute:: cui2name :value: None .. py:method:: calculate_label_map(dataset) .. py:method:: encode(examples, ignore_subwords = False) Used with huggingface datasets map function to convert medcat_ner dataset into the appropriate form for NER with BERT. It will split long text segments into max_len sequences (performs chunking). :param examples: Stream of examples. :type examples: Dict :param ignore_subwords: If set to `True` subwords of any token will get the special label `X`. :type ignore_subwords: bool :Returns: **Dict** -- The same dict, modified. .. py:method:: save(path) .. py:method:: ensure_tokenizer() .. py:method:: load(path) :classmethod: