medcat.components.ner.trf.tokenizer
Attributes
Classes
Args: |
Module Contents
- medcat.components.ner.trf.tokenizer.logger
- class medcat.components.ner.trf.tokenizer.TransformersTokenizer(hf_tokenizer=None, max_len=512, id2type=None, cui2name=None)
Args: hf_tokenizer
Must be able to return token offsets.
- max_len:
Max sequence length, if longer it will be split into multiple examples.
- id2type:
Can be ignored in most cases, should be a map from token to ‘start’ or ‘sub’ meaning is the token a subword or the start/full word. For BERT ‘start’ is everything that does not begin with ##.
- cui2name:
Map from CUI to full name for labels.
- Parameters:
hf_tokenizer (Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase])
max_len (int)
id2type (Optional[Dict])
cui2name (Optional[Dict])
- __init__(hf_tokenizer=None, max_len=512, id2type=None, cui2name=None)
- Parameters:
hf_tokenizer (Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase])
max_len (int)
id2type (Optional[Dict])
cui2name (Optional[Dict])
- Return type:
None
- hf_tokenizer = None
- max_len = 512
- label_map
- id2type = None
- cui2name = None
- calculate_label_map(dataset)
- Return type:
None
- encode(examples, ignore_subwords=False)
Used with huggingface datasets map function to convert medcat_ner dataset into the appropriate form for NER with BERT. It will split long text segments into max_len sequences (performs chunking).
- Parameters:
examples (Dict) – Stream of examples.
ignore_subwords (bool) – If set to True subwords of any token will get the special label X.
- Returns:
Dict – The same dict, modified.
- Return type:
Dict
- save(path)
- Parameters:
path (str)
- Return type:
None
- ensure_tokenizer()
- Return type:
transformers.tokenization_utils_base.PreTrainedTokenizerBase
- classmethod load(path)
- Parameters:
path (str)
- Return type: