medcat.components.ner.trf.tokenizer

Attributes

logger

Classes

TransformersTokenizer

Args:

Module Contents

medcat.components.ner.trf.tokenizer.logger
class medcat.components.ner.trf.tokenizer.TransformersTokenizer(hf_tokenizer=None, max_len=512, id2type=None, cui2name=None)

Args: hf_tokenizer

Must be able to return token offsets.

max_len:

Max sequence length, if longer it will be split into multiple examples.

id2type:

Can be ignored in most cases, should be a map from token to ‘start’ or ‘sub’ meaning is the token a subword or the start/full word. For BERT ‘start’ is everything that does not begin with ##.

cui2name:

Map from CUI to full name for labels.

Parameters:
  • hf_tokenizer (Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase])

  • max_len (int)

  • id2type (Optional[Dict])

  • cui2name (Optional[Dict])

__init__(hf_tokenizer=None, max_len=512, id2type=None, cui2name=None)
Parameters:
  • hf_tokenizer (Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase])

  • max_len (int)

  • id2type (Optional[Dict])

  • cui2name (Optional[Dict])

Return type:

None

hf_tokenizer = None
max_len = 512
label_map
id2type = None
cui2name = None
calculate_label_map(dataset)
Return type:

None

encode(examples, ignore_subwords=False)

Used with huggingface datasets map function to convert medcat_ner dataset into the appropriate form for NER with BERT. It will split long text segments into max_len sequences (performs chunking).

Parameters:
  • examples (Dict) – Stream of examples.

  • ignore_subwords (bool) – If set to True subwords of any token will get the special label X.

Returns:

Dict – The same dict, modified.

Return type:

Dict

save(path)
Parameters:

path (str)

Return type:

None

ensure_tokenizer()
Return type:

transformers.tokenization_utils_base.PreTrainedTokenizerBase

classmethod load(path)
Parameters:

path (str)

Return type:

TransformersTokenizer