medcat.components.addons.meta_cat.mctokenizers.bpe_tokenizer

Attributes

FAKE_TOKENIZER_PATH

Classes

TokenizerWrapperBase

Helper class that provides a standard way to create an ABC using

TokenizerWrapperBPE

Wrapper around a huggingface tokenizer so that it works with the

Module Contents

class medcat.components.addons.meta_cat.mctokenizers.bpe_tokenizer.TokenizerWrapperBase(hf_tokenizer=None)

Bases: abc.ABC

Helper class that provides a standard way to create an ABC using inheritance.

Parameters:

hf_tokenizer (Optional[tokenizers.Tokenizer])

name: str
__init__(hf_tokenizer=None)
Parameters:

hf_tokenizer (Optional[tokenizers.Tokenizer])

Return type:

None

hf_tokenizers = None
__call__(text: str) dict
__call__(text: list[str]) list[dict]
abstract save(dir_path)
Parameters:

dir_path (str)

Return type:

None

classmethod load(dir_path, model_variant='', **kwargs)
Abstractmethod:

Parameters:
  • dir_path (str)

  • model_variant (Optional[str])

Return type:

tokenizers.Tokenizer

abstract get_size()
Return type:

int

abstract token_to_id(token)
Parameters:

token (str)

Return type:

Union[int, list[int]]

abstract get_pad_id()
Return type:

Union[Optional[int], list[int]]

ensure_tokenizer()
Return type:

tokenizers.Tokenizer

__slots__ = ()
medcat.components.addons.meta_cat.mctokenizers.bpe_tokenizer.FAKE_TOKENIZER_PATH = Multiline-String
Show Value
"""#
/fake-path-not-exist#/"""
class medcat.components.addons.meta_cat.mctokenizers.bpe_tokenizer.TokenizerWrapperBPE(hf_tokenizers=None)

Bases: medcat.components.addons.meta_cat.mctokenizers.tokenizers.TokenizerWrapperBase

Wrapper around a huggingface tokenizer so that it works with the MetaCAT models.

Parameters:
  • tokenizers.ByteLevelBPETokenizer – A huggingface BBPE tokenizer.

  • hf_tokenizers (Optional[tokenizers.ByteLevelBPETokenizer])

name = 'bbpe'
__init__(hf_tokenizers=None)
Parameters:

hf_tokenizers (Optional[tokenizers.ByteLevelBPETokenizer])

Return type:

None

__call__(text: str) dict
__call__(text: list[str]) list[dict]

Tokenize some text

Parameters:

text (Union[str, list[str]]) – Text/texts to be tokenized.

Returns:

Union (dict, list[dict]) – Dictionary/ies containing offset_mapping, input_ids and tokens corresponding to the input text/s.

Raises:

Exception – If the input is something other than text or a list of text.

save(dir_path)
Parameters:

dir_path (str)

Return type:

None

classmethod load(dir_path, model_variant='', **kwargs)
Parameters:
  • dir_path (str)

  • model_variant (Optional[str])

Return type:

TokenizerWrapperBPE

classmethod create_new()
get_size()
Return type:

int

token_to_id(token)
Parameters:

token (str)

Return type:

Union[int, list[int]]

get_pad_id()
Return type:

Union[int, list[int]]

hf_tokenizers = None
ensure_tokenizer()
Return type:

tokenizers.Tokenizer

__slots__ = ()