medcat.tokenizing.tokens
Exceptions
Inappropriate argument value (of correct type). |
Classes
Base token protocol. |
|
The mutable part of a token. |
|
Base entity protocol. |
|
The mutable part of an entity. |
|
The base document protocol. |
|
The mutable parts of the document. |
Module Contents
- class medcat.tokenizing.tokens.BaseToken
Bases:
ProtocolBase token protocol.
This represents the static (unchangeable) parts of a token.
- property text: str
The text represented by this token.
- Return type:
str
- property lower: str
The lower case text representation.
- Return type:
str
- property text_versions: list[str]
The different versions of text (e.g normalised and lower)
- Return type:
list[str]
- property is_upper: bool
Whether the text is upper case.
- Return type:
bool
- property is_stop: bool
Whether the token represents a stop token.
- Return type:
bool
- property char_index: int
The character index of the start of this token
- Return type:
int
- property index: int
The index (in terms of tokens) of this token in the document.
- Return type:
int
- property text_with_ws: str
The text with tailing whitespace (where applicable).
- Return type:
str
- property is_digit: bool
Whether the token represents a digit.
- Return type:
bool
- __slots__ = ()
- _is_protocol = True
- _is_runtime_protocol = False
- classmethod __init_subclass__(*args, **kwargs)
- classmethod __class_getitem__(params)
- class medcat.tokenizing.tokens.MutableToken
Bases:
ProtocolThe mutable part of a token.
This protocol describes all the parts of a token that could be expected to change.
- property is_punctuation: bool
Whether the token represents punctuation.
- Return type:
bool
- property to_skip: bool
Whether the token should be skipped.
- Return type:
bool
- property lemma: str
The lemmatised version of the text.
- Return type:
str
- property tag: str | None
Optional tag (e.g) for normalization.
- Return type:
Optional[str]
- property norm: str
The normalised text.
- Return type:
str
- __slots__ = ()
- _is_protocol = True
- _is_runtime_protocol = False
- classmethod __init_subclass__(*args, **kwargs)
- classmethod __class_getitem__(params)
- class medcat.tokenizing.tokens.BaseEntity
Bases:
ProtocolBase entity protocol.
This describes the static (unchangeable) parts of an entity or sequence of tokens.
- property start_index: int
The index of the first token in the entity.
- Return type:
int
- property end_index: int
The index of the last token in the entity.
- Return type:
int
- property start_char_index: int
The character index of the first token.
- Return type:
int
- property end_char_index: int
The character index of the last token.
- Return type:
int
- property label: int
seems unused).
- Type:
The label of the entity (NOTE
- Return type:
int
- property text: str
The text of the entire entity.
- Return type:
str
- __len__()
- Return type:
int
- __slots__ = ()
- _is_protocol = True
- _is_runtime_protocol = False
- classmethod __init_subclass__(*args, **kwargs)
- classmethod __class_getitem__(params)
- class medcat.tokenizing.tokens.MutableEntity
Bases:
ProtocolThe mutable part of an entity.
This represent the changeable part of an entnity. That is, parts that should be changed by the various components.
- property base: BaseEntity
The base / static entity part.
- Return type:
- property detected_name: str
The detected name (if any) for this entity.
This should be set by the NER component.
- Return type:
str
- set_addon_data(path, val)
Used to add arbitrary data to the entity.
This is generally used by addons to keep track of their data.
NB! The path used needs to be registered using the register_addon_path class method.
- Parameters:
path (str) – The data ID / path.
val (Any) – The value to be added.
- Return type:
None
- has_addon_data(path)
Checks whether the addon data for a specific path has been set.
- Parameters:
path (str) – The path to check.
- Returns:
bool – Whether the addon data had been set.
- Return type:
bool
- get_addon_data(path)
Get data added to the entity.
See add_data for details.
- Parameters:
path (str) – The data ID / path.
- Returns:
Any – The stored value.
- Return type:
Any
- get_available_addon_paths()
Gets the available addon data paths for this entity.
This will only include paths that have values set.
- Returns:
list[str] – List of available addon data paths.
- Return type:
list[str]
- property link_candidates: list[str]
The candidates for the detected name (if any) for this entity.
This should be set by the NER component.
- Return type:
list[str]
- property context_similarity: float
The context similarity of the lnked entity.
This should be set by the linker component.
- Return type:
float
- property confidence: float
The confidence for the lnked entity.
NOTE: This seems to be unused!
- Return type:
float
- property cui: str
The CUI of the lnked entity.
This should be set by the linker component.
- Return type:
str
- property id: int
The ID of the entity within the document.
This counts all the entities recognised, not just ones that were successfully linked.
This should be set by the NER.
- Return type:
int
- classmethod register_addon_path(path, def_val=None, force=True)
Register a custom/arbitrary data path.
This can be used to store arbitrary data along with the entity for use in an addon (e.g MetaCAT).
PS: If using this, it is important to use paths namespaced to the component you’re using in order to avoid conflicts.
- Parameters:
path (str) – The path to be used. Should be prefixed by component name (e.g meta_cat_id for an ID tied to the meta_cat addon)
def_val (Any) – Default value. Defaults to None.
force (bool) – Whether to forcefully add the value. Defaults to True.
- Return type:
None
- __iter__()
- Return type:
Iterator[MutableToken]
- __len__()
- Return type:
int
- __slots__ = ()
- _is_protocol = True
- _is_runtime_protocol = False
- classmethod __init_subclass__(*args, **kwargs)
- classmethod __class_getitem__(params)
- class medcat.tokenizing.tokens.BaseDocument
Bases:
ProtocolThe base document protocol.
Represents the unchangeable parts of the whole document.
- property text: str
The document raw text.
- Return type:
str
- __getitem__(index: int) BaseToken
- __getitem__(index: slice) BaseEntity
- isupper()
Whether the entire document is upper case.
- Return type:
bool
- __slots__ = ()
- _is_protocol = True
- _is_runtime_protocol = False
- classmethod __init_subclass__(*args, **kwargs)
- classmethod __class_getitem__(params)
- class medcat.tokenizing.tokens.MutableDocument
Bases:
ProtocolThe mutable parts of the document.
Represents parts of the document that can / should be changed by the various components.
- property base: BaseDocument
The base document.
- Return type:
- property linked_ents: list[MutableEntity]
The linked entities associated with the document.
This should be set by the linker.
- Return type:
list[MutableEntity]
- property ner_ents: list[MutableEntity]
All entities recognised by NER.
This should be set by the NER component.
- Return type:
list[MutableEntity]
- __iter__()
- Return type:
Iterator[MutableToken]
- __getitem__(index: int) MutableToken
- __getitem__(index: slice) MutableEntity
- __len__()
- Return type:
int
- get_tokens(start_index, end_index)
Get the tokens that span the specified character indices.
- Parameters:
start_index (int) – The starting character index.
end_index (int) – The ending character index.
- Returns:
list[MutableToken] – The list of tokens.
- Return type:
list[MutableToken]
- set_addon_data(path, val)
Used to add arbitrary data to the entity.
This is generally used by addons to keep track of their data.
NB! The path used needs to be registered using the register_addon_path class method.
- Parameters:
path (str) – The data ID / path.
val (Any) – The value to be added.
- Return type:
None
- has_addon_data(path)
Checks whether the addon data for a specific path has been set.
- Parameters:
path (str) – The path to check.
- Returns:
bool – Whether the addon data had been set.
- Return type:
bool
- get_addon_data(path)
Get data added to the entity.
See add_data for details.
- Parameters:
path (str) – The data ID / path.
- Returns:
Any – The stored value.
- Return type:
Any
- get_available_addon_paths()
Gets the available addon data paths for this document.
This will only include paths that have values set.
- Returns:
list[str] – List of available addon data paths.
- Return type:
list[str]
- classmethod register_addon_path(path, def_val=None, force=True)
Register a custom/arbitrary data path.
This can be used to store arbitrary data along with the entity for use in an addon (e.g MetaCAT).
PS: If using this, it is important to use paths namespaced to the component you’re using in order to avoid conflicts.
- Parameters:
path (str) – The path to be used. Should be prefixed by component name (e.g meta_cat_id for an ID tied to the meta_cat addon)
def_val (Any) – Default value. Defaults to None.
force (bool) – Whether to forcefully add the value. Defaults to True.
- Return type:
None
- __slots__ = ()
- _is_protocol = True
- _is_runtime_protocol = False
- classmethod __init_subclass__(*args, **kwargs)
- classmethod __class_getitem__(params)
- exception medcat.tokenizing.tokens.UnregisteredDataPathException(cls, path)
Bases:
ValueErrorInappropriate argument value (of correct type).
- Parameters:
cls (Type)
path (str)
- __init__(cls, path)
Initialize self. See help(type(self)) for accurate signature.
- Parameters:
cls (Type)
path (str)
- cls
- path
- class __cause__
exception cause
- class __context__
exception context
- __delattr__()
Implement delattr(self, name).
- __dir__()
Default dir() implementation.
- __eq__()
Return self==value.
- __format__()
Default object formatter.
- __ge__()
Return self>=value.
- __getattribute__()
Return getattr(self, name).
- __gt__()
Return self>value.
- __hash__()
Return hash(self).
- __le__()
Return self<=value.
- __lt__()
Return self<value.
- __ne__()
Return self!=value.
- __new__()
Create and return a new object. See help(type) for accurate signature.
- __reduce__()
- __reduce_ex__()
Helper for pickle.
- __repr__()
Return repr(self).
- __setattr__()
Implement setattr(self, name, value).
- __setstate__()
- __sizeof__()
Size of object in memory, in bytes.
- __str__()
Return str(self).
- __subclasshook__()
Abstract classes can override this to customize issubclass().
This is invoked early on by abc.ABCMeta.__subclasscheck__(). It should return True, False or NotImplemented. If it returns NotImplemented, the normal algorithm is used. Otherwise, it overrides the normal algorithm (and the outcome is cached).
- class __suppress_context__
- class __traceback__
- class args
- with_traceback()
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.