medcat.tokenizing.spacy_impl.tokens
Attributes
Exceptions
Inappropriate argument value (of correct type). |
Classes
Base token protocol. |
|
The mutable part of a token. |
|
Base entity protocol. |
|
The mutable part of an entity. |
|
The base document protocol. |
|
Module Contents
- class medcat.tokenizing.spacy_impl.tokens.BaseToken
Bases:
ProtocolBase token protocol.
This represents the static (unchangeable) parts of a token.
- property text: str
The text represented by this token.
- Return type:
str
- property lower: str
The lower case text representation.
- Return type:
str
- property text_versions: list[str]
The different versions of text (e.g normalised and lower)
- Return type:
list[str]
- property is_upper: bool
Whether the text is upper case.
- Return type:
bool
- property is_stop: bool
Whether the token represents a stop token.
- Return type:
bool
- property char_index: int
The character index of the start of this token
- Return type:
int
- property index: int
The index (in terms of tokens) of this token in the document.
- Return type:
int
- property text_with_ws: str
The text with tailing whitespace (where applicable).
- Return type:
str
- property is_digit: bool
Whether the token represents a digit.
- Return type:
bool
- __slots__ = ()
- _is_protocol = True
- _is_runtime_protocol = False
- classmethod __init_subclass__(*args, **kwargs)
- classmethod __class_getitem__(params)
- class medcat.tokenizing.spacy_impl.tokens.MutableToken
Bases:
ProtocolThe mutable part of a token.
This protocol describes all the parts of a token that could be expected to change.
- property is_punctuation: bool
Whether the token represents punctuation.
- Return type:
bool
- property to_skip: bool
Whether the token should be skipped.
- Return type:
bool
- property lemma: str
The lemmatised version of the text.
- Return type:
str
- property tag: str | None
Optional tag (e.g) for normalization.
- Return type:
Optional[str]
- property norm: str
The normalised text.
- Return type:
str
- __slots__ = ()
- _is_protocol = True
- _is_runtime_protocol = False
- classmethod __init_subclass__(*args, **kwargs)
- classmethod __class_getitem__(params)
- class medcat.tokenizing.spacy_impl.tokens.BaseEntity
Bases:
ProtocolBase entity protocol.
This describes the static (unchangeable) parts of an entity or sequence of tokens.
- property start_index: int
The index of the first token in the entity.
- Return type:
int
- property end_index: int
The index of the last token in the entity.
- Return type:
int
- property start_char_index: int
The character index of the first token.
- Return type:
int
- property end_char_index: int
The character index of the last token.
- Return type:
int
- property label: int
seems unused).
- Type:
The label of the entity (NOTE
- Return type:
int
- property text: str
The text of the entire entity.
- Return type:
str
- __len__()
- Return type:
int
- __slots__ = ()
- _is_protocol = True
- _is_runtime_protocol = False
- classmethod __init_subclass__(*args, **kwargs)
- classmethod __class_getitem__(params)
- class medcat.tokenizing.spacy_impl.tokens.MutableEntity
Bases:
ProtocolThe mutable part of an entity.
This represent the changeable part of an entnity. That is, parts that should be changed by the various components.
- property base: BaseEntity
The base / static entity part.
- Return type:
- property detected_name: str
The detected name (if any) for this entity.
This should be set by the NER component.
- Return type:
str
- set_addon_data(path, val)
Used to add arbitrary data to the entity.
This is generally used by addons to keep track of their data.
NB! The path used needs to be registered using the register_addon_path class method.
- Parameters:
path (str) – The data ID / path.
val (Any) – The value to be added.
- Return type:
None
- has_addon_data(path)
Checks whether the addon data for a specific path has been set.
- Parameters:
path (str) – The path to check.
- Returns:
bool – Whether the addon data had been set.
- Return type:
bool
- get_addon_data(path)
Get data added to the entity.
See add_data for details.
- Parameters:
path (str) – The data ID / path.
- Returns:
Any – The stored value.
- Return type:
Any
- get_available_addon_paths()
Gets the available addon data paths for this entity.
This will only include paths that have values set.
- Returns:
list[str] – List of available addon data paths.
- Return type:
list[str]
- property link_candidates: list[str]
The candidates for the detected name (if any) for this entity.
This should be set by the NER component.
- Return type:
list[str]
- property context_similarity: float
The context similarity of the lnked entity.
This should be set by the linker component.
- Return type:
float
- property confidence: float
The confidence for the lnked entity.
NOTE: This seems to be unused!
- Return type:
float
- property cui: str
The CUI of the lnked entity.
This should be set by the linker component.
- Return type:
str
- property id: int
The ID of the entity within the document.
This counts all the entities recognised, not just ones that were successfully linked.
This should be set by the NER.
- Return type:
int
- classmethod register_addon_path(path, def_val=None, force=True)
Register a custom/arbitrary data path.
This can be used to store arbitrary data along with the entity for use in an addon (e.g MetaCAT).
PS: If using this, it is important to use paths namespaced to the component you’re using in order to avoid conflicts.
- Parameters:
path (str) – The path to be used. Should be prefixed by component name (e.g meta_cat_id for an ID tied to the meta_cat addon)
def_val (Any) – Default value. Defaults to None.
force (bool) – Whether to forcefully add the value. Defaults to True.
- Return type:
None
- __iter__()
- Return type:
Iterator[MutableToken]
- __len__()
- Return type:
int
- __slots__ = ()
- _is_protocol = True
- _is_runtime_protocol = False
- classmethod __init_subclass__(*args, **kwargs)
- classmethod __class_getitem__(params)
- class medcat.tokenizing.spacy_impl.tokens.BaseDocument
Bases:
ProtocolThe base document protocol.
Represents the unchangeable parts of the whole document.
- property text: str
The document raw text.
- Return type:
str
- __getitem__(index: int) BaseToken
- __getitem__(index: slice) BaseEntity
- isupper()
Whether the entire document is upper case.
- Return type:
bool
- __slots__ = ()
- _is_protocol = True
- _is_runtime_protocol = False
- classmethod __init_subclass__(*args, **kwargs)
- classmethod __class_getitem__(params)
- exception medcat.tokenizing.spacy_impl.tokens.UnregisteredDataPathException(cls, path)
Bases:
ValueErrorInappropriate argument value (of correct type).
- Parameters:
cls (Type)
path (str)
- __init__(cls, path)
Initialize self. See help(type(self)) for accurate signature.
- Parameters:
cls (Type)
path (str)
- cls
- path
- class __cause__
exception cause
- class __context__
exception context
- __delattr__()
Implement delattr(self, name).
- __dir__()
Default dir() implementation.
- __eq__()
Return self==value.
- __format__()
Default object formatter.
- __ge__()
Return self>=value.
- __getattribute__()
Return getattr(self, name).
- __gt__()
Return self>value.
- __hash__()
Return hash(self).
- __le__()
Return self<=value.
- __lt__()
Return self<value.
- __ne__()
Return self!=value.
- __new__()
Create and return a new object. See help(type) for accurate signature.
- __reduce__()
- __reduce_ex__()
Helper for pickle.
- __repr__()
Return repr(self).
- __setattr__()
Implement setattr(self, name, value).
- __setstate__()
- __sizeof__()
Size of object in memory, in bytes.
- __str__()
Return str(self).
- __subclasshook__()
Abstract classes can override this to customize issubclass().
This is invoked early on by abc.ABCMeta.__subclasscheck__(). It should return True, False or NotImplemented. If it returns NotImplemented, the normal algorithm is used. Otherwise, it overrides the normal algorithm (and the outcome is cached).
- class __suppress_context__
- class __traceback__
- class args
- with_traceback()
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- medcat.tokenizing.spacy_impl.tokens.logger
- class medcat.tokenizing.spacy_impl.tokens.Token(delegate)
- Parameters:
delegate (spacy.tokens.Token)
- __init__(delegate)
- Parameters:
delegate (spacy.tokens.Token)
- Return type:
None
- _delegate
- property is_punctuation: bool
- Return type:
bool
- property to_skip: bool
- Return type:
bool
- property norm: str
- Return type:
str
- property base: medcat.tokenizing.tokens.BaseToken
- Return type:
- property text: str
- Return type:
str
- property text_versions: list[str]
- Return type:
list[str]
- property lower: str
- Return type:
str
- property is_stop: bool
- Return type:
bool
- property is_digit: bool
- Return type:
bool
- property is_upper: bool
- Return type:
bool
- property tag: str | None
- Return type:
Optional[str]
- property lemma: str
- Return type:
str
- property text_with_ws: str
- Return type:
str
- property char_index: int
- Return type:
int
- property index: int
- Return type:
int
- __str__()
- __repr__()
- __hash__()
- Return type:
int
- __eq__(value)
- Return type:
bool
- class medcat.tokenizing.spacy_impl.tokens.Entity(delegate)
- Parameters:
delegate (spacy.tokens.Span)
- _addon_extension_paths: set[str]
- __init__(delegate)
- Parameters:
delegate (spacy.tokens.Span)
- Return type:
None
- _delegate
- link_candidates: list[str] = []
- context_similarity: float = 0.0
- confidence: float = 0.0
- cui = ''
- id = -1
- detected_name = ''
- property base: medcat.tokenizing.tokens.BaseEntity
- Return type:
- set_addon_data(path, val)
- Parameters:
path (str)
val (Any)
- Return type:
None
- has_addon_data(path)
- Parameters:
path (str)
- Return type:
bool
- get_addon_data(path)
- Parameters:
path (str)
- Return type:
Any
- get_available_addon_paths()
- Return type:
list[str]
- classmethod register_addon_path(path, def_val=None, force=True)
- Parameters:
path (str)
def_val (Any)
force (bool)
- Return type:
None
- property text: str
- Return type:
str
- property label: int
- Return type:
int
- property start_index: int
- Return type:
int
- property end_index: int
- Return type:
int
- property start_char_index: int
- Return type:
int
- property end_char_index: int
- Return type:
int
- __iter__()
- Return type:
Iterator[medcat.tokenizing.tokens.MutableToken]
- __len__()
- Return type:
int
- __str__()
- __repr__()
- class medcat.tokenizing.spacy_impl.tokens.Document(delegate)
- Parameters:
delegate (spacy.tokens.Doc)
- _addon_extension_paths: set[str]
- __init__(delegate)
- Parameters:
delegate (spacy.tokens.Doc)
- Return type:
None
- _delegate
- ner_ents: list[medcat.tokenizing.tokens.MutableEntity] = []
- linked_ents: list[medcat.tokenizing.tokens.MutableEntity] = []
- property base: medcat.tokenizing.tokens.BaseDocument
- Return type:
- property text: str
- Return type:
str
- __getitem__(index: int) medcat.tokenizing.tokens.MutableToken
- __getitem__(index: slice) medcat.tokenizing.tokens.MutableEntity
- __len__()
- Return type:
int
- get_tokens(start_index, end_index)
- Parameters:
start_index (int)
end_index (int)
- Return type:
- set_addon_data(path, val)
- Parameters:
path (str)
val (Any)
- Return type:
None
- has_addon_data(path)
- Parameters:
path (str)
- Return type:
bool
- get_addon_data(path)
- Parameters:
path (str)
- Return type:
Any
- get_available_addon_paths()
- Return type:
list[str]
- classmethod register_addon_path(path, def_val=None, force=True)
- Parameters:
path (str)
def_val (Any)
force (bool)
- Return type:
None
- __iter__()
- Return type:
Iterator[medcat.tokenizing.tokens.MutableToken]
- isupper()
- Return type:
bool
- __str__()
- __repr__()