medcat.tokenizing.spacy_impl.tokens

Attributes

logger

Exceptions

UnregisteredDataPathException

Inappropriate argument value (of correct type).

Classes

BaseToken

Base token protocol.

MutableToken

The mutable part of a token.

BaseEntity

Base entity protocol.

MutableEntity

The mutable part of an entity.

BaseDocument

The base document protocol.

Token

Entity

Document

Module Contents

class medcat.tokenizing.spacy_impl.tokens.BaseToken

Bases: Protocol

Base token protocol.

This represents the static (unchangeable) parts of a token.

property text: str

The text represented by this token.

Return type:

str

property lower: str

The lower case text representation.

Return type:

str

property text_versions: list[str]

The different versions of text (e.g normalised and lower)

Return type:

list[str]

property is_upper: bool

Whether the text is upper case.

Return type:

bool

property is_stop: bool

Whether the token represents a stop token.

Return type:

bool

property char_index: int

The character index of the start of this token

Return type:

int

property index: int

The index (in terms of tokens) of this token in the document.

Return type:

int

property text_with_ws: str

The text with tailing whitespace (where applicable).

Return type:

str

property is_digit: bool

Whether the token represents a digit.

Return type:

bool

__slots__ = ()
_is_protocol = True
_is_runtime_protocol = False
classmethod __init_subclass__(*args, **kwargs)
classmethod __class_getitem__(params)
class medcat.tokenizing.spacy_impl.tokens.MutableToken

Bases: Protocol

The mutable part of a token.

This protocol describes all the parts of a token that could be expected to change.

property base: BaseToken

The base portion of the token.

Return type:

BaseToken

property is_punctuation: bool

Whether the token represents punctuation.

Return type:

bool

property to_skip: bool

Whether the token should be skipped.

Return type:

bool

property lemma: str

The lemmatised version of the text.

Return type:

str

property tag: str | None

Optional tag (e.g) for normalization.

Return type:

Optional[str]

property norm: str

The normalised text.

Return type:

str

__slots__ = ()
_is_protocol = True
_is_runtime_protocol = False
classmethod __init_subclass__(*args, **kwargs)
classmethod __class_getitem__(params)
class medcat.tokenizing.spacy_impl.tokens.BaseEntity

Bases: Protocol

Base entity protocol.

This describes the static (unchangeable) parts of an entity or sequence of tokens.

property start_index: int

The index of the first token in the entity.

Return type:

int

property end_index: int

The index of the last token in the entity.

Return type:

int

property start_char_index: int

The character index of the first token.

Return type:

int

property end_char_index: int

The character index of the last token.

Return type:

int

property label: int

seems unused).

Type:

The label of the entity (NOTE

Return type:

int

property text: str

The text of the entire entity.

Return type:

str

__iter__()
Return type:

Iterator[BaseToken]

__len__()
Return type:

int

__slots__ = ()
_is_protocol = True
_is_runtime_protocol = False
classmethod __init_subclass__(*args, **kwargs)
classmethod __class_getitem__(params)
class medcat.tokenizing.spacy_impl.tokens.MutableEntity

Bases: Protocol

The mutable part of an entity.

This represent the changeable part of an entnity. That is, parts that should be changed by the various components.

property base: BaseEntity

The base / static entity part.

Return type:

BaseEntity

property detected_name: str

The detected name (if any) for this entity.

This should be set by the NER component.

Return type:

str

set_addon_data(path, val)

Used to add arbitrary data to the entity.

This is generally used by addons to keep track of their data.

NB! The path used needs to be registered using the register_addon_path class method.

Parameters:
  • path (str) – The data ID / path.

  • val (Any) – The value to be added.

Return type:

None

has_addon_data(path)

Checks whether the addon data for a specific path has been set.

Parameters:

path (str) – The path to check.

Returns:

bool – Whether the addon data had been set.

Return type:

bool

get_addon_data(path)

Get data added to the entity.

See add_data for details.

Parameters:

path (str) – The data ID / path.

Returns:

Any – The stored value.

Return type:

Any

get_available_addon_paths()

Gets the available addon data paths for this entity.

This will only include paths that have values set.

Returns:

list[str] – List of available addon data paths.

Return type:

list[str]

The candidates for the detected name (if any) for this entity.

This should be set by the NER component.

Return type:

list[str]

property context_similarity: float

The context similarity of the lnked entity.

This should be set by the linker component.

Return type:

float

property confidence: float

The confidence for the lnked entity.

NOTE: This seems to be unused!

Return type:

float

property cui: str

The CUI of the lnked entity.

This should be set by the linker component.

Return type:

str

property id: int

The ID of the entity within the document.

This counts all the entities recognised, not just ones that were successfully linked.

This should be set by the NER.

Return type:

int

classmethod register_addon_path(path, def_val=None, force=True)

Register a custom/arbitrary data path.

This can be used to store arbitrary data along with the entity for use in an addon (e.g MetaCAT).

PS: If using this, it is important to use paths namespaced to the component you’re using in order to avoid conflicts.

Parameters:
  • path (str) – The path to be used. Should be prefixed by component name (e.g meta_cat_id for an ID tied to the meta_cat addon)

  • def_val (Any) – Default value. Defaults to None.

  • force (bool) – Whether to forcefully add the value. Defaults to True.

Return type:

None

__iter__()
Return type:

Iterator[MutableToken]

__len__()
Return type:

int

__slots__ = ()
_is_protocol = True
_is_runtime_protocol = False
classmethod __init_subclass__(*args, **kwargs)
classmethod __class_getitem__(params)
class medcat.tokenizing.spacy_impl.tokens.BaseDocument

Bases: Protocol

The base document protocol.

Represents the unchangeable parts of the whole document.

property text: str

The document raw text.

Return type:

str

__getitem__(index: int) BaseToken
__getitem__(index: slice) BaseEntity
__iter__()
Return type:

Iterator[BaseToken]

isupper()

Whether the entire document is upper case.

Return type:

bool

__slots__ = ()
_is_protocol = True
_is_runtime_protocol = False
classmethod __init_subclass__(*args, **kwargs)
classmethod __class_getitem__(params)
exception medcat.tokenizing.spacy_impl.tokens.UnregisteredDataPathException(cls, path)

Bases: ValueError

Inappropriate argument value (of correct type).

Parameters:
  • cls (Type)

  • path (str)

__init__(cls, path)

Initialize self. See help(type(self)) for accurate signature.

Parameters:
  • cls (Type)

  • path (str)

cls
path
class __cause__

exception cause

class __context__

exception context

__delattr__()

Implement delattr(self, name).

__dir__()

Default dir() implementation.

__eq__()

Return self==value.

__format__()

Default object formatter.

__ge__()

Return self>=value.

__getattribute__()

Return getattr(self, name).

__gt__()

Return self>value.

__hash__()

Return hash(self).

__le__()

Return self<=value.

__lt__()

Return self<value.

__ne__()

Return self!=value.

__new__()

Create and return a new object. See help(type) for accurate signature.

__reduce__()
__reduce_ex__()

Helper for pickle.

__repr__()

Return repr(self).

__setattr__()

Implement setattr(self, name, value).

__setstate__()
__sizeof__()

Size of object in memory, in bytes.

__str__()

Return str(self).

__subclasshook__()

Abstract classes can override this to customize issubclass().

This is invoked early on by abc.ABCMeta.__subclasscheck__(). It should return True, False or NotImplemented. If it returns NotImplemented, the normal algorithm is used. Otherwise, it overrides the normal algorithm (and the outcome is cached).

class __suppress_context__
class __traceback__
class args
with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

medcat.tokenizing.spacy_impl.tokens.logger
class medcat.tokenizing.spacy_impl.tokens.Token(delegate)
Parameters:

delegate (spacy.tokens.Token)

__init__(delegate)
Parameters:

delegate (spacy.tokens.Token)

Return type:

None

_delegate
property is_punctuation: bool
Return type:

bool

property to_skip: bool
Return type:

bool

property norm: str
Return type:

str

property base: medcat.tokenizing.tokens.BaseToken
Return type:

medcat.tokenizing.tokens.BaseToken

property text: str
Return type:

str

property text_versions: list[str]
Return type:

list[str]

property lower: str
Return type:

str

property is_stop: bool
Return type:

bool

property is_digit: bool
Return type:

bool

property is_upper: bool
Return type:

bool

property tag: str | None
Return type:

Optional[str]

property lemma: str
Return type:

str

property text_with_ws: str
Return type:

str

property char_index: int
Return type:

int

property index: int
Return type:

int

__str__()
__repr__()
__hash__()
Return type:

int

__eq__(value)
Return type:

bool

class medcat.tokenizing.spacy_impl.tokens.Entity(delegate)
Parameters:

delegate (spacy.tokens.Span)

_addon_extension_paths: set[str]
__init__(delegate)
Parameters:

delegate (spacy.tokens.Span)

Return type:

None

_delegate
context_similarity: float = 0.0
confidence: float = 0.0
cui = ''
id = -1
detected_name = ''
property base: medcat.tokenizing.tokens.BaseEntity
Return type:

medcat.tokenizing.tokens.BaseEntity

set_addon_data(path, val)
Parameters:
  • path (str)

  • val (Any)

Return type:

None

has_addon_data(path)
Parameters:

path (str)

Return type:

bool

get_addon_data(path)
Parameters:

path (str)

Return type:

Any

get_available_addon_paths()
Return type:

list[str]

classmethod register_addon_path(path, def_val=None, force=True)
Parameters:
  • path (str)

  • def_val (Any)

  • force (bool)

Return type:

None

property text: str
Return type:

str

property label: int
Return type:

int

property start_index: int
Return type:

int

property end_index: int
Return type:

int

property start_char_index: int
Return type:

int

property end_char_index: int
Return type:

int

__iter__()
Return type:

Iterator[medcat.tokenizing.tokens.MutableToken]

__len__()
Return type:

int

__str__()
__repr__()
class medcat.tokenizing.spacy_impl.tokens.Document(delegate)
Parameters:

delegate (spacy.tokens.Doc)

_addon_extension_paths: set[str]
__init__(delegate)
Parameters:

delegate (spacy.tokens.Doc)

Return type:

None

_delegate
ner_ents: list[medcat.tokenizing.tokens.MutableEntity] = []
linked_ents: list[medcat.tokenizing.tokens.MutableEntity] = []
property base: medcat.tokenizing.tokens.BaseDocument
Return type:

medcat.tokenizing.tokens.BaseDocument

property text: str
Return type:

str

__getitem__(index: int) medcat.tokenizing.tokens.MutableToken
__getitem__(index: slice) medcat.tokenizing.tokens.MutableEntity
__len__()
Return type:

int

get_tokens(start_index, end_index)
Parameters:
  • start_index (int)

  • end_index (int)

Return type:

list[medcat.tokenizing.tokens.MutableToken]

set_addon_data(path, val)
Parameters:
  • path (str)

  • val (Any)

Return type:

None

has_addon_data(path)
Parameters:

path (str)

Return type:

bool

get_addon_data(path)
Parameters:

path (str)

Return type:

Any

get_available_addon_paths()
Return type:

list[str]

classmethod register_addon_path(path, def_val=None, force=True)
Parameters:
  • path (str)

  • def_val (Any)

  • force (bool)

Return type:

None

__iter__()
Return type:

Iterator[medcat.tokenizing.tokens.MutableToken]

isupper()
Return type:

bool

__str__()
__repr__()