medcat.components.ner.trf.tokenizer

Attributes

logger

Classes

TransformersTokenizer

Args:

Module Contents

medcat.components.ner.trf.tokenizer.logger

class medcat.components.ner.trf.tokenizer.TransformersTokenizer(hf_tokenizer=None, max_len=512, id2type=None, cui2name=None)

Args: hf_tokenizer

Must be able to return token offsets.

max_len:: Max sequence length, if longer it will be split into multiple examples.
id2type:: Can be ignored in most cases, should be a map from token to ‘start’ or ‘sub’ meaning is the token a subword or the start/full word. For BERT ‘start’ is everything that does not begin with ##.
cui2name:: Map from CUI to full name for labels.

Parameters:

hf_tokenizer (Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase])
max_len (int)
id2type (Optional[Dict])
cui2name (Optional[Dict])

__init__(hf_tokenizer=None, max_len=512, id2type=None, cui2name=None)

Parameters:

hf_tokenizer (Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase])
max_len (int)
id2type (Optional[Dict])
cui2name (Optional[Dict])

Return type:

None

hf_tokenizer = None

max_len = 512

label_map

id2type = None

cui2name = None

calculate_label_map(dataset)

Return type:: None

encode(examples, ignore_subwords=False)

Used with huggingface datasets map function to convert medcat_ner dataset into the appropriate form for NER with BERT. It will split long text segments into max_len sequences (performs chunking).

Parameters:

examples (Dict) – Stream of examples.
ignore_subwords (bool) – If set to True subwords of any token will get the special label X.

Returns:

Dict – The same dict, modified.

Return type:

Dict

save(path)

Parameters:: path (str)
Return type:: None

ensure_tokenizer()

Return type:: transformers.tokenization_utils_base.PreTrainedTokenizerBase

classmethod load(path)

Parameters:: path (str)
Return type:: TransformersTokenizer