medcat.preprocessors.cleaners

Exceptions

UnknownTokenVersion

Inappropriate argument value (of correct type).

Classes

MutableDocument

The mutable parts of the document.

BaseTokenizer

The base tokenizer protocol.

NameDescriptor

LGeneral

Base class for protocol classes.

LPreprocessing

Base class for protocol classes.

LCDBMaker

Base class for protocol classes.

Functions

_get_tokens(config, sc_name, version)

_update_dict(configs, raw_name, names, tokens, is_upper)

prepare_name(raw_name, nlp, names, configs)

Generates different forms of a name. Will edit the provided names

Module Contents

class medcat.preprocessors.cleaners.MutableDocument

Bases: Protocol

The mutable parts of the document.

Represents parts of the document that can / should be changed by the various components.

property base: BaseDocument

The base document.

Return type:

BaseDocument

property linked_ents: list[MutableEntity]

The linked entities associated with the document.

This should be set by the linker.

Return type:

list[MutableEntity]

property ner_ents: list[MutableEntity]

All entities recognised by NER.

This should be set by the NER component.

Return type:

list[MutableEntity]

__iter__()
Return type:

Iterator[MutableToken]

__getitem__(index: int) MutableToken
__getitem__(index: slice) MutableEntity
__len__()
Return type:

int

get_tokens(start_index, end_index)

Get the tokens that span the specified character indices.

Parameters:
  • start_index (int) – The starting character index.

  • end_index (int) – The ending character index.

Returns:

list[MutableToken] – The list of tokens.

Return type:

list[MutableToken]

set_addon_data(path, val)

Used to add arbitrary data to the entity.

This is generally used by addons to keep track of their data.

NB! The path used needs to be registered using the register_addon_path class method.

Parameters:
  • path (str) – The data ID / path.

  • val (Any) – The value to be added.

Return type:

None

has_addon_data(path)

Checks whether the addon data for a specific path has been set.

Parameters:

path (str) – The path to check.

Returns:

bool – Whether the addon data had been set.

Return type:

bool

get_addon_data(path)

Get data added to the entity.

See add_data for details.

Parameters:

path (str) – The data ID / path.

Returns:

Any – The stored value.

Return type:

Any

get_available_addon_paths()

Gets the available addon data paths for this document.

This will only include paths that have values set.

Returns:

list[str] – List of available addon data paths.

Return type:

list[str]

classmethod register_addon_path(path, def_val=None, force=True)

Register a custom/arbitrary data path.

This can be used to store arbitrary data along with the entity for use in an addon (e.g MetaCAT).

PS: If using this, it is important to use paths namespaced to the component you’re using in order to avoid conflicts.

Parameters:
  • path (str) – The path to be used. Should be prefixed by component name (e.g meta_cat_id for an ID tied to the meta_cat addon)

  • def_val (Any) – Default value. Defaults to None.

  • force (bool) – Whether to forcefully add the value. Defaults to True.

Return type:

None

__slots__ = ()
_is_protocol = True
_is_runtime_protocol = False
classmethod __init_subclass__(*args, **kwargs)
classmethod __class_getitem__(params)
class medcat.preprocessors.cleaners.BaseTokenizer

Bases: Protocol

The base tokenizer protocol.

create_entity(doc, token_start_index, token_end_index, label)

Create an entity from a document.

Parameters:
  • doc (MutableDocument) – The document to use.

  • token_start_index (int) – The token start index.

  • token_end_index (int) – The token end index.

  • label (str) – The label.

Returns:

MutableEntity – The resulting entity.

Return type:

medcat.tokenizing.tokens.MutableEntity

entity_from_tokens(tokens)

Get an entity from the list of tokens.

Parameters:

tokens (list[MutableToken]) – List of tokens.

Returns:

MutableEntity – The resulting entity.

Return type:

medcat.tokenizing.tokens.MutableEntity

__call__(text)
Parameters:

text (str)

Return type:

medcat.tokenizing.tokens.MutableDocument

classmethod create_new_tokenizer(config)
Parameters:

config (medcat.config.Config)

Return type:

typing_extensions.Self

get_doc_class()

Get the document implementation class used by the tokenizer.

This can be used (e.g) to register addon paths.

Returns:

Type[MutableDocument] – The document class.

Return type:

Type[medcat.tokenizing.tokens.MutableDocument]

get_entity_class()

Get the entity implementation class used by the tokenizer.

Returns:

Type[MutableEntity] – The entity class.

Return type:

Type[medcat.tokenizing.tokens.MutableEntity]

__slots__ = ()
_is_protocol = True
_is_runtime_protocol = False
classmethod __init_subclass__(*args, **kwargs)
classmethod __class_getitem__(params)
class medcat.preprocessors.cleaners.NameDescriptor
tokens: list[str]
snames: set[str]
raw_name: str
is_upper: bool
class medcat.preprocessors.cleaners.LGeneral

Bases: Protocol

Base class for protocol classes.

Protocol classes are defined as:

class Proto(Protocol):
    def meth(self) -> int:
        ...

Such classes are primarily used with static type checkers that recognize structural subtyping (static duck-typing), for example:

class C:
    def meth(self) -> int:
        return 0

def func(x: Proto) -> int:
    return x.meth()

func(C())  # Passes static type check

See PEP 544 for details. Protocol classes decorated with @typing.runtime_checkable act as simple-minded runtime protocols that check only the presence of given attributes, ignoring their type signatures. Protocol classes can be generic, they are defined as:

class GenProto(Protocol[T]):
    def meth(self) -> T:
        ...
separator: str
__slots__ = ()
_is_protocol = True
_is_runtime_protocol = False
classmethod __init_subclass__(*args, **kwargs)
classmethod __class_getitem__(params)
class medcat.preprocessors.cleaners.LPreprocessing

Bases: Protocol

Base class for protocol classes.

Protocol classes are defined as:

class Proto(Protocol):
    def meth(self) -> int:
        ...

Such classes are primarily used with static type checkers that recognize structural subtyping (static duck-typing), for example:

class C:
    def meth(self) -> int:
        return 0

def func(x: Proto) -> int:
    return x.meth()

func(C())  # Passes static type check

See PEP 544 for details. Protocol classes decorated with @typing.runtime_checkable act as simple-minded runtime protocols that check only the presence of given attributes, ignoring their type signatures. Protocol classes can be generic, they are defined as:

class GenProto(Protocol[T]):
    def meth(self) -> T:
        ...
min_len_normalize: int
do_not_normalize: set[str]
__slots__ = ()
_is_protocol = True
_is_runtime_protocol = False
classmethod __init_subclass__(*args, **kwargs)
classmethod __class_getitem__(params)
class medcat.preprocessors.cleaners.LCDBMaker

Bases: Protocol

Base class for protocol classes.

Protocol classes are defined as:

class Proto(Protocol):
    def meth(self) -> int:
        ...

Such classes are primarily used with static type checkers that recognize structural subtyping (static duck-typing), for example:

class C:
    def meth(self) -> int:
        return 0

def func(x: Proto) -> int:
    return x.meth()

func(C())  # Passes static type check

See PEP 544 for details. Protocol classes decorated with @typing.runtime_checkable act as simple-minded runtime protocols that check only the presence of given attributes, ignoring their type signatures. Protocol classes can be generic, they are defined as:

class GenProto(Protocol[T]):
    def meth(self) -> T:
        ...
name_versions: list[str]
min_letters_required: int
__slots__ = ()
_is_protocol = True
_is_runtime_protocol = False
classmethod __init_subclass__(*args, **kwargs)
classmethod __class_getitem__(params)
medcat.preprocessors.cleaners._get_tokens(config, sc_name, version)
Parameters:
Return type:

list[str]

medcat.preprocessors.cleaners._update_dict(configs, raw_name, names, tokens, is_upper)
Parameters:
Return type:

None

medcat.preprocessors.cleaners.prepare_name(raw_name, nlp, names, configs)

Generates different forms of a name. Will edit the provided names dictionary and add information generated from the name.

Parameters:
  • nlp (BaseTokenizer) – The tokenizer.

  • names (dict[str, NameDescriptor]) – Dictionary of existing names for this concept in this row of a CSV. The new generated name versions and other required information will be added here.

  • configs (tuple[LGeneral, LPreprocessing, LCDBMaker]) – Applicable configs for medcat.

  • raw_name (str)

Returns:

names (dict) – The updated dictionary of prepared names.

Return type:

dict[str, NameDescriptor]

exception medcat.preprocessors.cleaners.UnknownTokenVersion(version)

Bases: ValueError

Inappropriate argument value (of correct type).

Parameters:

version (str)

__init__(version)

Initialize self. See help(type(self)) for accurate signature.

Parameters:

version (str)

Return type:

None

class __cause__

exception cause

class __context__

exception context

__delattr__()

Implement delattr(self, name).

__dir__()

Default dir() implementation.

__eq__()

Return self==value.

__format__()

Default object formatter.

__ge__()

Return self>=value.

__getattribute__()

Return getattr(self, name).

__gt__()

Return self>value.

__hash__()

Return hash(self).

__le__()

Return self<=value.

__lt__()

Return self<value.

__ne__()

Return self!=value.

__new__()

Create and return a new object. See help(type) for accurate signature.

__reduce__()
__reduce_ex__()

Helper for pickle.

__repr__()

Return repr(self).

__setattr__()

Implement setattr(self, name, value).

__setstate__()
__sizeof__()

Size of object in memory, in bytes.

__str__()

Return str(self).

__subclasshook__()

Abstract classes can override this to customize issubclass().

This is invoked early on by abc.ABCMeta.__subclasscheck__(). It should return True, False or NotImplemented. If it returns NotImplemented, the normal algorithm is used. Otherwise, it overrides the normal algorithm (and the outcome is cached).

class __suppress_context__
class __traceback__
class args
with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.