medcat.preprocessors.cleaners ============================= .. py:module:: medcat.preprocessors.cleaners Exceptions ---------- .. autoapisummary:: medcat.preprocessors.cleaners.UnknownTokenVersion Classes ------- .. autoapisummary:: medcat.preprocessors.cleaners.MutableDocument medcat.preprocessors.cleaners.BaseTokenizer medcat.preprocessors.cleaners.NameDescriptor medcat.preprocessors.cleaners.LGeneral medcat.preprocessors.cleaners.LPreprocessing medcat.preprocessors.cleaners.LCDBMaker Functions --------- .. autoapisummary:: medcat.preprocessors.cleaners._get_tokens medcat.preprocessors.cleaners._update_dict medcat.preprocessors.cleaners.prepare_name Module Contents --------------- .. py:class:: MutableDocument Bases: :py:obj:`Protocol` The mutable parts of the document. Represents parts of the document that can / should be changed by the various components. .. py:property:: base :type: BaseDocument The base document. .. py:property:: linked_ents :type: list[MutableEntity] The linked entities associated with the document. This should be set by the linker. .. py:property:: ner_ents :type: list[MutableEntity] All entities recognised by NER. This should be set by the NER component. .. py:method:: __iter__() .. py:method:: __getitem__(index: int) -> MutableToken __getitem__(index: slice) -> MutableEntity .. py:method:: __len__() .. py:method:: get_tokens(start_index, end_index) Get the tokens that span the specified character indices. :param start_index: The starting character index. :type start_index: int :param end_index: The ending character index. :type end_index: int :Returns: **list[MutableToken]** -- The list of tokens. .. py:method:: set_addon_data(path, val) Used to add arbitrary data to the entity. This is generally used by addons to keep track of their data. NB! The path used needs to be registered using the `register_addon_path` class method. :param path: The data ID / path. :type path: str :param val: The value to be added. :type val: Any .. py:method:: has_addon_data(path) Checks whether the addon data for a specific path has been set. :param path: The path to check. :type path: str :Returns: **bool** -- Whether the addon data had been set. .. py:method:: get_addon_data(path) Get data added to the entity. See `add_data` for details. :param path: The data ID / path. :type path: str :Returns: **Any** -- The stored value. .. py:method:: get_available_addon_paths() Gets the available addon data paths for this document. This will only include paths that have values set. :Returns: **list[str]** -- List of available addon data paths. .. py:method:: register_addon_path(path, def_val = None, force = True) :classmethod: Register a custom/arbitrary data path. This can be used to store arbitrary data along with the entity for use in an addon (e.g MetaCAT). PS: If using this, it is important to use paths namespaced to the component you're using in order to avoid conflicts. :param path: The path to be used. Should be prefixed by component name (e.g `meta_cat_id` for an ID tied to the `meta_cat` addon) :type path: str :param def_val: Default value. Defaults to `None`. :type def_val: Any :param force: Whether to forcefully add the value. Defaults to True. :type force: bool .. py:attribute:: __slots__ :value: () .. py:attribute:: _is_protocol :value: True .. py:attribute:: _is_runtime_protocol :value: False .. py:method:: __init_subclass__(*args, **kwargs) :classmethod: .. py:method:: __class_getitem__(params) :classmethod: .. py:class:: BaseTokenizer Bases: :py:obj:`Protocol` The base tokenizer protocol. .. py:method:: create_entity(doc, token_start_index, token_end_index, label) Create an entity from a document. :param doc: The document to use. :type doc: MutableDocument :param token_start_index: The token start index. :type token_start_index: int :param token_end_index: The token end index. :type token_end_index: int :param label: The label. :type label: str :Returns: **MutableEntity** -- The resulting entity. .. py:method:: entity_from_tokens(tokens) Get an entity from the list of tokens. :param tokens: List of tokens. :type tokens: list[MutableToken] :Returns: **MutableEntity** -- The resulting entity. .. py:method:: __call__(text) .. py:method:: create_new_tokenizer(config) :classmethod: .. py:method:: get_doc_class() Get the document implementation class used by the tokenizer. This can be used (e.g) to register addon paths. :Returns: **Type[MutableDocument]** -- The document class. .. py:method:: get_entity_class() Get the entity implementation class used by the tokenizer. :Returns: **Type[MutableEntity]** -- The entity class. .. py:attribute:: __slots__ :value: () .. py:attribute:: _is_protocol :value: True .. py:attribute:: _is_runtime_protocol :value: False .. py:method:: __init_subclass__(*args, **kwargs) :classmethod: .. py:method:: __class_getitem__(params) :classmethod: .. py:class:: NameDescriptor .. py:attribute:: tokens :type: list[str] .. py:attribute:: snames :type: set[str] .. py:attribute:: raw_name :type: str .. py:attribute:: is_upper :type: bool .. py:class:: LGeneral Bases: :py:obj:`Protocol` Base class for protocol classes. Protocol classes are defined as:: class Proto(Protocol): def meth(self) -> int: ... Such classes are primarily used with static type checkers that recognize structural subtyping (static duck-typing), for example:: class C: def meth(self) -> int: return 0 def func(x: Proto) -> int: return x.meth() func(C()) # Passes static type check See PEP 544 for details. Protocol classes decorated with @typing.runtime_checkable act as simple-minded runtime protocols that check only the presence of given attributes, ignoring their type signatures. Protocol classes can be generic, they are defined as:: class GenProto(Protocol[T]): def meth(self) -> T: ... .. py:attribute:: separator :type: str .. py:attribute:: __slots__ :value: () .. py:attribute:: _is_protocol :value: True .. py:attribute:: _is_runtime_protocol :value: False .. py:method:: __init_subclass__(*args, **kwargs) :classmethod: .. py:method:: __class_getitem__(params) :classmethod: .. py:class:: LPreprocessing Bases: :py:obj:`Protocol` Base class for protocol classes. Protocol classes are defined as:: class Proto(Protocol): def meth(self) -> int: ... Such classes are primarily used with static type checkers that recognize structural subtyping (static duck-typing), for example:: class C: def meth(self) -> int: return 0 def func(x: Proto) -> int: return x.meth() func(C()) # Passes static type check See PEP 544 for details. Protocol classes decorated with @typing.runtime_checkable act as simple-minded runtime protocols that check only the presence of given attributes, ignoring their type signatures. Protocol classes can be generic, they are defined as:: class GenProto(Protocol[T]): def meth(self) -> T: ... .. py:attribute:: min_len_normalize :type: int .. py:attribute:: do_not_normalize :type: set[str] .. py:attribute:: __slots__ :value: () .. py:attribute:: _is_protocol :value: True .. py:attribute:: _is_runtime_protocol :value: False .. py:method:: __init_subclass__(*args, **kwargs) :classmethod: .. py:method:: __class_getitem__(params) :classmethod: .. py:class:: LCDBMaker Bases: :py:obj:`Protocol` Base class for protocol classes. Protocol classes are defined as:: class Proto(Protocol): def meth(self) -> int: ... Such classes are primarily used with static type checkers that recognize structural subtyping (static duck-typing), for example:: class C: def meth(self) -> int: return 0 def func(x: Proto) -> int: return x.meth() func(C()) # Passes static type check See PEP 544 for details. Protocol classes decorated with @typing.runtime_checkable act as simple-minded runtime protocols that check only the presence of given attributes, ignoring their type signatures. Protocol classes can be generic, they are defined as:: class GenProto(Protocol[T]): def meth(self) -> T: ... .. py:attribute:: name_versions :type: list[str] .. py:attribute:: min_letters_required :type: int .. py:attribute:: __slots__ :value: () .. py:attribute:: _is_protocol :value: True .. py:attribute:: _is_runtime_protocol :value: False .. py:method:: __init_subclass__(*args, **kwargs) :classmethod: .. py:method:: __class_getitem__(params) :classmethod: .. py:function:: _get_tokens(config, sc_name, version) .. py:function:: _update_dict(configs, raw_name, names, tokens, is_upper) .. py:function:: prepare_name(raw_name, nlp, names, configs) Generates different forms of a name. Will edit the provided `names` dictionary and add information generated from the `name`. :param nlp: The tokenizer. :type nlp: BaseTokenizer :param names: Dictionary of existing names for this concept in this row of a CSV. The new generated name versions and other required information will be added here. :type names: dict[str, NameDescriptor] :param configs: Applicable configs for medcat. :type configs: tuple[LGeneral, LPreprocessing, LCDBMaker] :Returns: **names** (*dict*) -- The updated dictionary of prepared names. .. py:exception:: UnknownTokenVersion(version) Bases: :py:obj:`ValueError` Inappropriate argument value (of correct type). .. py:method:: __init__(version) Initialize self. See help(type(self)) for accurate signature. .. py:class:: __cause__ exception cause .. py:class:: __context__ exception context .. py:method:: __delattr__() Implement delattr(self, name). .. py:method:: __dir__() Default dir() implementation. .. py:method:: __eq__() Return self==value. .. py:method:: __format__() Default object formatter. .. py:method:: __ge__() Return self>=value. .. py:method:: __getattribute__() Return getattr(self, name). .. py:method:: __gt__() Return self>value. .. py:method:: __hash__() Return hash(self). .. py:method:: __le__() Return self<=value. .. py:method:: __lt__() Return self