medcat.preprocessors.cleaners
Exceptions
Inappropriate argument value (of correct type). |
Classes
The mutable parts of the document. |
|
The base tokenizer protocol. |
|
Base class for protocol classes. |
|
Base class for protocol classes. |
|
Base class for protocol classes. |
Functions
|
|
|
|
|
Generates different forms of a name. Will edit the provided names |
Module Contents
- class medcat.preprocessors.cleaners.MutableDocument
Bases:
ProtocolThe mutable parts of the document.
Represents parts of the document that can / should be changed by the various components.
- property base: BaseDocument
The base document.
- Return type:
- property linked_ents: list[MutableEntity]
The linked entities associated with the document.
This should be set by the linker.
- Return type:
list[MutableEntity]
- property ner_ents: list[MutableEntity]
All entities recognised by NER.
This should be set by the NER component.
- Return type:
list[MutableEntity]
- __iter__()
- Return type:
Iterator[MutableToken]
- __getitem__(index: int) MutableToken
- __getitem__(index: slice) MutableEntity
- __len__()
- Return type:
int
- get_tokens(start_index, end_index)
Get the tokens that span the specified character indices.
- Parameters:
start_index (int) – The starting character index.
end_index (int) – The ending character index.
- Returns:
list[MutableToken] – The list of tokens.
- Return type:
list[MutableToken]
- set_addon_data(path, val)
Used to add arbitrary data to the entity.
This is generally used by addons to keep track of their data.
NB! The path used needs to be registered using the register_addon_path class method.
- Parameters:
path (str) – The data ID / path.
val (Any) – The value to be added.
- Return type:
None
- has_addon_data(path)
Checks whether the addon data for a specific path has been set.
- Parameters:
path (str) – The path to check.
- Returns:
bool – Whether the addon data had been set.
- Return type:
bool
- get_addon_data(path)
Get data added to the entity.
See add_data for details.
- Parameters:
path (str) – The data ID / path.
- Returns:
Any – The stored value.
- Return type:
Any
- get_available_addon_paths()
Gets the available addon data paths for this document.
This will only include paths that have values set.
- Returns:
list[str] – List of available addon data paths.
- Return type:
list[str]
- classmethod register_addon_path(path, def_val=None, force=True)
Register a custom/arbitrary data path.
This can be used to store arbitrary data along with the entity for use in an addon (e.g MetaCAT).
PS: If using this, it is important to use paths namespaced to the component you’re using in order to avoid conflicts.
- Parameters:
path (str) – The path to be used. Should be prefixed by component name (e.g meta_cat_id for an ID tied to the meta_cat addon)
def_val (Any) – Default value. Defaults to None.
force (bool) – Whether to forcefully add the value. Defaults to True.
- Return type:
None
- __slots__ = ()
- _is_protocol = True
- _is_runtime_protocol = False
- classmethod __init_subclass__(*args, **kwargs)
- classmethod __class_getitem__(params)
- class medcat.preprocessors.cleaners.BaseTokenizer
Bases:
ProtocolThe base tokenizer protocol.
- create_entity(doc, token_start_index, token_end_index, label)
Create an entity from a document.
- Parameters:
doc (MutableDocument) – The document to use.
token_start_index (int) – The token start index.
token_end_index (int) – The token end index.
label (str) – The label.
- Returns:
MutableEntity – The resulting entity.
- Return type:
- entity_from_tokens(tokens)
Get an entity from the list of tokens.
- Parameters:
tokens (list[MutableToken]) – List of tokens.
- Returns:
MutableEntity – The resulting entity.
- Return type:
- __call__(text)
- Parameters:
text (str)
- Return type:
- classmethod create_new_tokenizer(config)
- Parameters:
config (medcat.config.Config)
- Return type:
typing_extensions.Self
- get_doc_class()
Get the document implementation class used by the tokenizer.
This can be used (e.g) to register addon paths.
- Returns:
Type[MutableDocument] – The document class.
- Return type:
- get_entity_class()
Get the entity implementation class used by the tokenizer.
- Returns:
Type[MutableEntity] – The entity class.
- Return type:
- __slots__ = ()
- _is_protocol = True
- _is_runtime_protocol = False
- classmethod __init_subclass__(*args, **kwargs)
- classmethod __class_getitem__(params)
- class medcat.preprocessors.cleaners.NameDescriptor
- tokens: list[str]
- snames: set[str]
- raw_name: str
- is_upper: bool
- class medcat.preprocessors.cleaners.LGeneral
Bases:
ProtocolBase class for protocol classes.
Protocol classes are defined as:
class Proto(Protocol): def meth(self) -> int: ...
Such classes are primarily used with static type checkers that recognize structural subtyping (static duck-typing), for example:
class C: def meth(self) -> int: return 0 def func(x: Proto) -> int: return x.meth() func(C()) # Passes static type check
See PEP 544 for details. Protocol classes decorated with @typing.runtime_checkable act as simple-minded runtime protocols that check only the presence of given attributes, ignoring their type signatures. Protocol classes can be generic, they are defined as:
class GenProto(Protocol[T]): def meth(self) -> T: ...
- separator: str
- __slots__ = ()
- _is_protocol = True
- _is_runtime_protocol = False
- classmethod __init_subclass__(*args, **kwargs)
- classmethod __class_getitem__(params)
- class medcat.preprocessors.cleaners.LPreprocessing
Bases:
ProtocolBase class for protocol classes.
Protocol classes are defined as:
class Proto(Protocol): def meth(self) -> int: ...
Such classes are primarily used with static type checkers that recognize structural subtyping (static duck-typing), for example:
class C: def meth(self) -> int: return 0 def func(x: Proto) -> int: return x.meth() func(C()) # Passes static type check
See PEP 544 for details. Protocol classes decorated with @typing.runtime_checkable act as simple-minded runtime protocols that check only the presence of given attributes, ignoring their type signatures. Protocol classes can be generic, they are defined as:
class GenProto(Protocol[T]): def meth(self) -> T: ...
- min_len_normalize: int
- do_not_normalize: set[str]
- __slots__ = ()
- _is_protocol = True
- _is_runtime_protocol = False
- classmethod __init_subclass__(*args, **kwargs)
- classmethod __class_getitem__(params)
- class medcat.preprocessors.cleaners.LCDBMaker
Bases:
ProtocolBase class for protocol classes.
Protocol classes are defined as:
class Proto(Protocol): def meth(self) -> int: ...
Such classes are primarily used with static type checkers that recognize structural subtyping (static duck-typing), for example:
class C: def meth(self) -> int: return 0 def func(x: Proto) -> int: return x.meth() func(C()) # Passes static type check
See PEP 544 for details. Protocol classes decorated with @typing.runtime_checkable act as simple-minded runtime protocols that check only the presence of given attributes, ignoring their type signatures. Protocol classes can be generic, they are defined as:
class GenProto(Protocol[T]): def meth(self) -> T: ...
- name_versions: list[str]
- min_letters_required: int
- __slots__ = ()
- _is_protocol = True
- _is_runtime_protocol = False
- classmethod __init_subclass__(*args, **kwargs)
- classmethod __class_getitem__(params)
- medcat.preprocessors.cleaners._get_tokens(config, sc_name, version)
- Parameters:
config (LPreprocessing)
sc_name (medcat.tokenizing.tokens.MutableDocument)
version (str)
- Return type:
list[str]
- medcat.preprocessors.cleaners._update_dict(configs, raw_name, names, tokens, is_upper)
- Parameters:
configs (tuple[LGeneral, LPreprocessing, LCDBMaker])
raw_name (str)
names (dict[str, NameDescriptor])
tokens (list[str])
is_upper (bool)
- Return type:
None
- medcat.preprocessors.cleaners.prepare_name(raw_name, nlp, names, configs)
Generates different forms of a name. Will edit the provided names dictionary and add information generated from the name.
- Parameters:
nlp (BaseTokenizer) – The tokenizer.
names (dict[str, NameDescriptor]) – Dictionary of existing names for this concept in this row of a CSV. The new generated name versions and other required information will be added here.
configs (tuple[LGeneral, LPreprocessing, LCDBMaker]) – Applicable configs for medcat.
raw_name (str)
- Returns:
names (dict) – The updated dictionary of prepared names.
- Return type:
dict[str, NameDescriptor]
- exception medcat.preprocessors.cleaners.UnknownTokenVersion(version)
Bases:
ValueErrorInappropriate argument value (of correct type).
- Parameters:
version (str)
- __init__(version)
Initialize self. See help(type(self)) for accurate signature.
- Parameters:
version (str)
- Return type:
None
- class __cause__
exception cause
- class __context__
exception context
- __delattr__()
Implement delattr(self, name).
- __dir__()
Default dir() implementation.
- __eq__()
Return self==value.
- __format__()
Default object formatter.
- __ge__()
Return self>=value.
- __getattribute__()
Return getattr(self, name).
- __gt__()
Return self>value.
- __hash__()
Return hash(self).
- __le__()
Return self<=value.
- __lt__()
Return self<value.
- __ne__()
Return self!=value.
- __new__()
Create and return a new object. See help(type) for accurate signature.
- __reduce__()
- __reduce_ex__()
Helper for pickle.
- __repr__()
Return repr(self).
- __setattr__()
Implement setattr(self, name, value).
- __setstate__()
- __sizeof__()
Size of object in memory, in bytes.
- __str__()
Return str(self).
- __subclasshook__()
Abstract classes can override this to customize issubclass().
This is invoked early on by abc.ABCMeta.__subclasscheck__(). It should return True, False or NotImplemented. If it returns NotImplemented, the normal algorithm is used. Otherwise, it overrides the normal algorithm (and the outcome is cached).
- class __suppress_context__
- class __traceback__
- class args
- with_traceback()
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.