medcat.utils.regression.regression_checker

Attributes

`STRICTNESS_MATRIX`
`logger`
`DEFAULT_TEST_SUITE_PATH`
`parser`

Classes

`CAT`	This is a collection of serialisable model parts.
`RegressionSuite`	The regression checker.
`TranslationLayer`	The translation layer for translating:
`Strictness`	The total strictness on which to judge the results.
`Finding`	Describes whether or how the finding verified.

Functions

`show_description`()
`main`(model_pack_dir, test_suite_file[, phrases, ...])	Check test suite against the specifeid model pack.
`tuple3_parser`(arg)

Module Contents

class medcat.utils.regression.regression_checker.CAT(cdb, vocab=None, config=None, model_load_path=None)

Bases: medcat.storage.serialisables.AbstractSerialisable

This is a collection of serialisable model parts.

Parameters:

cdb (medcat.cdb.CDB)
vocab (Union[medcat.vocab.Vocab, None])
config (Optional[medcat.config.config.Config])
model_load_path (Optional[str])

__init__(cdb, vocab=None, config=None, model_load_path=None)

Parameters:

cdb (medcat.cdb.CDB)
vocab (Union[medcat.vocab.Vocab, None])
config (Optional[medcat.config.config.Config])
model_load_path (Optional[str])

Return type:

None

cdb

vocab = None

config = None

_trainer: medcat.trainer.Trainer | None = None

_pipeline

usage_monitor

_recreate_pipe(model_load_path=None)

Parameters:: model_load_path (Optional[str])
Return type:: medcat.pipeline.pipeline.Pipeline

classmethod get_init_attrs()

Return type:: list[str]

classmethod ignore_attrs()

Return type:: list[str]

__call__(text)

Parameters:: text (str)
Return type:: Optional[medcat.tokenizing.tokens.MutableDocument]

_ensure_not_training()

Method to ensure config is not set to train.

config.components.linking.train should only be True while training and not during inference. This aalso corrects the setting if necessary.

Return type:: None

get_entities(text: str, only_cui: Literal[False] = False) → medcat.data.entities.Entities

get_entities(text: str, only_cui: Literal[True] = True) → medcat.data.entities.OnlyCUIEntities

get_entities(text: str, only_cui: bool = False) → dict | medcat.data.entities.Entities | medcat.data.entities.OnlyCUIEntities

Get the entities recognised and linked within the provided text.

This will run the text through the pipeline and annotated the recognised and linked entities.

Parameters:

text (str) – The text to use.
only_cui (bool, optional) – Whether to only output the CUIs rather than the entire context. Defaults to False.

Returns:

Union[dict, Entities, OnlyCUIEntities] – The entities found and linked within the text.

_mp_worker_func(texts_and_indices)

Parameters:: texts_and_indices (list[tuple[str, str, bool]])
Return type:: list[tuple[str, str, Union[dict, medcat.data.entities.Entities, medcat.data.entities.OnlyCUIEntities]]]

_generate_batches_by_char_length(text_iter, batch_size_chars, only_cui)

Parameters:

text_iter (Union[Iterator[str], Iterator[tuple[str, str]]])
batch_size_chars (int)
only_cui (bool)

Return type:

Iterator[list[tuple[str, str, bool]]]

_generate_batches(text_iter, batch_size, batch_size_chars, only_cui)

Parameters:

text_iter (Union[Iterator[str], Iterator[tuple[str, str]]])
batch_size (int)
batch_size_chars (int)
only_cui (bool)

Return type:

Iterator[list[tuple[str, str, bool]]]

_generate_simple_batches(text_iter, batch_size, only_cui)

Parameters:

text_iter (Union[Iterator[str], Iterator[tuple[str, str]]])
batch_size (int)
only_cui (bool)

Return type:

Iterator[list[tuple[str, str, bool]]]

_mp_one_batch_per_process(executor, batch_iter, external_processes)

Parameters:

executor (concurrent.futures.ProcessPoolExecutor)
batch_iter (Iterator[list[tuple[str, str, bool]]])
external_processes (int)

Return type:

Iterator[tuple[str, Union[dict, medcat.data.entities.Entities, medcat.data.entities.OnlyCUIEntities]]]

get_entities_multi_texts(texts, only_cui=False, n_process=1, batch_size=-1, batch_size_chars=1000000)

Get entities from multiple texts (potentially in parallel).

If n_process > 1, n_process - 1 new processes will be created and data will be processed on those as well as the main process in parallel.

Parameters:

texts (Union[Iterable[str], Iterable[tuple[str, str]]]) – The input text. Either an iterable of raw text or one with in the format of (text_index, text).
only_cui (bool) – Whether to only return CUIs rather than other information like start/end and annotated value. Defaults to False.
n_process (int) – Number of processes to use. Defaults to 1.
batch_size (int) – The number of texts to batch at a time. A batch of the specified size will be given to each worker process. Defaults to -1 and in this case the character count will be used instead.
batch_size_chars (int) – The maximum number of characters to process in a batch. Each process will be given batch of texts with a total number of characters not exceeding this value. Defaults to 1,000,000 characters. Set to -1 to disable.

Yields:

Iterator[tuple[str, Union[dict, Entities, OnlyCUIEntities]]] – The results in the format of (text_index, entities).

Return type:

Iterator[tuple[str, Union[dict, medcat.data.entities.Entities, medcat.data.entities.OnlyCUIEntities]]]

_get_entity(ent, doc_tokens, cui)

Parameters:

ent (medcat.tokenizing.tokens.MutableEntity)
doc_tokens (list[str])
cui (str)

Return type:

medcat.data.entities.Entity

get_addon_output(ent)

Get the addon output for the entity.

This includes a key-value pair for each addon that provides some. Sometimes same-type addons may combine their output under the same key.

Parameters:: ent (MutableEntity) – The entity in quesiton.
Raises:: ValueError – If unable to merge multiple addon output.
Returns:: dict[str, dict] – All the addon output.
Return type:: dict[str, dict]

_doc_to_out_entity(ent, doc_tokens, only_cui)

Parameters:

ent (medcat.tokenizing.tokens.MutableEntity)
doc_tokens (list[str])
only_cui (bool)

Return type:

tuple[int, Union[medcat.data.entities.Entity, str]]

_doc_to_out(doc, only_cui, out_with_text=False)

Parameters:

doc (medcat.tokenizing.tokens.MutableDocument)
only_cui (bool)
out_with_text (bool)

Return type:

Union[medcat.data.entities.Entities, medcat.data.entities.OnlyCUIEntities]

property trainer: The trainer object.

save_model_pack(target_folder, pack_name=DEFAULT_PACK_NAME, serialiser_type='dill', make_archive=True, only_archive=False, add_hash_to_pack_name=True, change_description=None)

Save model pack.

The resulting model pack name will have the hash of the model pack in its name if (and only if) the default model pack name is used.

Parameters:

target_folder (str) – The folder to save the pack in.
pack_name (str, optional) – The model pack name. Defaults to DEFAULT_PACK_NAME.
serialiser_type (Union[str, AvailableSerialisers], optional) – The serialiser type. Defaults to ‘dill’.
make_archive (bool) – Whether to make the arhive /.zip file. Defaults to True.
only_archive (bool) – Whether to clear the non-compressed folder. Defaults to False.
add_hash_to_pack_name (bool) – Whether to add the hash to the pack name. This is only relevant if pack_name is specified. Defaults to True.
change_description (Optional[str]) – If provided, this the description will be added to the model description. Defaults to None.

Returns:

str – The final model pack path.

Return type:

str

_get_hash()

Return type:: str

_versioning(change_description)

Parameters:: change_description (Optional[str])
Return type:: str

classmethod attempt_unpack(zip_path)

Attempt unpack the zip to a folder and get the model pack path.

If the folder already exists, no unpacking is done.

Parameters:: zip_path (str) – The ZIP path
Returns:: str – The model pack path
Return type:: str

classmethod load_model_pack(model_pack_path)

Load the model pack from file.

Parameters:: model_pack_path (str) – The model pack path.
Raises:: ValueError – If the saved data does not represent a model pack.
Returns:: CAT – The loaded model pack.
Return type:: CAT

classmethod load_cdb(model_pack_path)

Loads the concept database from the provided model pack path

Parameters:: model_pack_path (str) – path to model pack, zip or dir.
Returns:: CDB – The loaded concept database
Return type:: medcat.cdb.CDB

get_model_card(as_dict: Literal[True]) → medcat.data.model_card.ModelCard

get_model_card(as_dict: Literal[False]) → str

Get the model card either a (nested) dict or a json string.

Parameters:: as_dict (bool) – Whether to return as dict. Defaults to False.
Returns:: Union[str, ModelCard] – The model card.

__eq__(other)

Parameters:: other (Any)
Return type:: bool

add_addon(addon)

Parameters:: addon (medcat.components.addons.addons.AddonComponent)
Return type:: None

get_strategy()

Return type:: SerialisingStrategy

classmethod include_properties()

Return type:: list[str]

class medcat.utils.regression.regression_checker.RegressionSuite(cases, metadata, name)

The regression checker. This is used to check a bunch of regression cases at once against a model.

Parameters:

cases (list[RegressionCase]) – The list of regression cases
metadata (MetaData) – The metadata for the regression suite
use_report (bool) – Whether or not to use the report functionality. Defaults to False.
name (str)

__init__(cases, metadata, name)

Parameters:

cases (list[RegressionCase])
metadata (MetaData)
name (str)

Return type:

None

cases: list[RegressionCase]

report

metadata

get_all_distinct_cases(translation, edit_distance, use_diacritics)

Gets all the distinct cases for this regression suite.

While distinct cases can be determined without the translation layer, including it here simplifies the process.

Parameters:

translation (TranslationLayer) – The translation layer.
edit_distance (tuple[int, int, int]) – The edit distance(s) to try. Defaults to (0, 0, 0).
use_diacritics (bool) – Whether to use diacritics for edit distance.

Yields:

Iterator[tuple[RegressionCase, Iterator[FinalTarget]]] – The generator of the regression case along with its corresponding sub-cases.

Return type:

Iterator[tuple[RegressionCase, Iterator[medcat.utils.regression.targeting.FinalTarget]]]

estimate_total_distinct_cases()

Return type:: int

iter_subcases(translation, show_progress=True, edit_distance=(0, 0, 0), use_diacritics=False)

Iterate over all the sub-cases.

Each sub-case present a unique target (phrase, concept, name) on the corresponding regression case.

Parameters:

translation (TranslationLayer) – The translation layer.
show_progress (bool) – Whether to show progress. Defaults to True.
edit_distance (tuple[int, int, int]) – The edit distance(s) to try. Defaults to (0, 0, 0).
use_diacritics (bool) – Whether to use diacritics for edit distance.

Yields:

Iterator[tuple[RegressionCase, FinalTarget]] –

The generator of the: regression case along with each of the final target sub-cases.

Return type:

Iterator[tuple[RegressionCase, medcat.utils.regression.targeting.FinalTarget]]

check_model(cat, translation, edit_distance=(0, 0, 0), use_diacritics=False)

Checks model and generates a report

Parameters:

cat (CAT) – The model to check against
translation (TranslationLayer) – The translation layer
edit_distance (tuple[int, int, int]) – The edit distance of the names. Defaults to (0, 0, 0).
use_diacritics (bool) – Whether to use diacritics for edit distance.

Returns:

MultiDescriptor – A report description

Return type:

medcat.utils.regression.results.MultiDescriptor

__str__()

Return type:: str

__repr__()

Return type:: str

to_dict()

Converts the RegressionChecker to dict for serialisation.

Returns:: dict – The dict representation
Return type:: dict

to_yaml()

Convert the RegressionChecker to YAML string.

Returns:: str – The YAML representation
Return type:: str

__eq__(other)

Parameters:: other (object)
Return type:: bool

classmethod from_dict(in_dict, name)

Construct a RegressionChecker from a dict.

Most of the parsing is handled in RegressionChecker.from_dict. This just assumes that each key in the dict is a name and each value describes a RegressionCase.

Parameters:

in_dict (dict) – The input dict.
name (str) – The name of the regression suite.

Returns:

RegressionChecker – The built regression checker

Return type:

RegressionSuite

classmethod from_yaml(file_name)

Constructs a RegressionChcker from a YAML file.

The from_dict method is used for the construction from the dict.

Parameters:: file_name (str) – The file name
Returns:: RegressionChecker – The constructed regression checker
Return type:: RegressionSuite

classmethod from_mct_export(file_name)

Parameters:: file_name (str)
Return type:: RegressionSuite

class medcat.utils.regression.regression_checker.TranslationLayer(cui2info, name2info, cui2children, separator, whitespace=' ')

The translation layer for translating: - CUIs to names - names to CUIs - type_ids to CUIs - CUIs to chil CUIs

The idea is to decouple these translations from the CDB instance in case something changes there.

Parameters:

cui2info (dict[str, CUIInfo]) – The map from CUI to names
name2info (dict[str, NameInfo]) – The map from name to CUIs
cui2type_ids (dict[str, set[str]]) – The map from CUI to type_ids
cui2children (dict[str, set[str]]) – The map from CUI to child CUIs
separator (str)
whitespace (str)

__init__(cui2info, name2info, cui2children, separator, whitespace=' ')

Parameters:

cui2info (dict[str, medcat.cdb.concepts.CUIInfo])
name2info (dict[str, medcat.cdb.concepts.NameInfo])
cui2children (dict[str, set[str]])
separator (str)
whitespace (str)

Return type:

None

cui2info

name2info

separator

whitespace = ' '

type_id2cuis: dict[str, set[str]]

cui2children

get_names_of(cui, only_prefnames)

Get the preprocessed names of a CUI.

This method preporcesses the names by replacing the separator (generally ~) with the appropriate whitespace (` `).

If the concept is not in the underlying CDB, an empty list is returned.

Parameters:

cui (str) – The concept in question.
only_prefnames (bool) – Whether to only return a preferred name.

Returns:

list[str] – The list of names.

Return type:

list[str]

get_preferred_name(cui)

Get the preferred name of a concept.

If no preferred name is found, the random ‘first’ name is selected.

Parameters:: cui (str) – The concept ID.
Returns:: str – The preferred name.
Return type:: str

get_first_name(cui)

Get the preprocessed (potentially) arbitrarily first name of the given concept.

If the concept does not exist, the CUI itself is returned.

PS: The “first” name may not be consistent across runs since it relies on set order.

Parameters:: cui (str) – The concept ID.
Returns:: str – The first name.
Return type:: str

get_direct_children(cui)

Get the direct children of a concept.

This means only the children, but not grandchildren.

If the underlying CDB doesn’t list children for this CUI, an empty list is returned.

Parameters:: cui (str) – The concept in question.
Returns:: list[str] – The (potentially empty) list of direct children.
Return type:: list[str]

get_direct_parents(cui)

Get the direct parent(s) of a concept.

PS: This method can be quite a CPU heavy one since it relies: on running through all the parent-children relationships since the child->parent(s) relationship isn’t normally kept track of.

Parameters:: cui (str) – _description_
Returns:: list[str] – _description_
Return type:: list[str]

get_children_of(found_cuis, cui, depth=1)

Get the children of the specifeid CUI in the listed CUIs (if they exist).

Parameters:

found_cuis (Iterable[str]) – The list of CUIs to look in
cui (str) – The target parent CUI
depth (int) – The depth to carry out the search for

Returns:

list[str] – The list of children found

Return type:

list[str]

classmethod from_CDB(cdb)

Construct a TranslationLayer object from a context database (CDB).

This translation layer will refer to the same dicts that the CDB refers to. While there is no obvious reason these should be modified, it’s something to keep in mind.

Parameters:: cdb (CDB) – The CDB
Returns:: TranslationLayer – The subsequent TranslationLayer
Return type:: TranslationLayer

class medcat.utils.regression.regression_checker.Strictness

Bases: enum.Enum

The total strictness on which to judge the results.

STRICTEST: The strictest option which only allows identical findings.

STRICT: A strict option which allows identical or children.

NORMAL: Normal strictness also allows partial overlaps on target concept and children.

LENIENT: Lenient stictness also allows parents and grandparents.

ANYTHING

Anything stricness allows ANY finding.

This would generally only be relevant when disabling examples for results descriptors.

__new__(value)

_generate_next_value_(start, count, last_values)

Generate the next value when not given.

name: the name of the member start: the initial start value or None count: the number of existing members last_value: the last value assigned or None

classmethod _missing_(value)

__repr__()

__str__()

__dir__(): Returns all members and all public methods

__format__(format_spec): Returns format using actual value type unless __str__ has been overridden.

__hash__()

__reduce_ex__(proto)

name(): The name of the Enum member.

value(): The value of the Enum member.

class medcat.utils.regression.regression_checker.Finding

Bases: enum.Enum

Describes whether or how the finding verified.

The idea is that we know where we expect the entity to be recognised and the enum constants describe how the recognition compared to the expectation.

In essence, we want to know the relative positions of the two pairs of numbers (character numbers): - Expected Start, Expected End - Recognised Start, Recognised End

We can model this as 4 numbers on the number line. And we want to know their position relative to each other. For example, if the expected positions are marked with * and recognised positions with #, we may have something like: ___*__#_______#*______________ Which would indicate that there is a partial, but smaller span recognised.

IDENTICAL: The CUI and the span recognised are identical to what was expected.

BIGGER_SPAN_RIGHT

The CUI is the same, but the recognised span is longer on the right.

If we use the notation from the class doc string, e.g: _*#__*__#

BIGGER_SPAN_LEFT

The CUI is the same, but the recognised span is longer on the left.

If we use the notation from the class doc string, e.g: _#_*__*#_

BIGGER_SPAN_BOTH

The CUI is the same, but the recognised span is longer on both sides.

If we use the notation from the class doc string, e.g: _#__*__*__#_

SMALLER_SPAN

The CUI is the same, but the recognised span is smaller.

If we use the notation from the class doc string, e.g: _*_#_#_*_ (neither start nor end match) _*#_#_*__ (start matches, but end is before expected) _*__#_#*_ (end matches, but start is after expected)

PARTIAL_OVERLAP

The CUI is the same, but the span overlaps partially.

If we use the notation from the class doc string, e.g: _*_#__*_#_ (starts between expected start and end, but ends beyond) _#_*_#_*__ (start before expected start, but ends between expected start and end)

FOUND_DIR_PARENT: The recognised CUI is a parent of the expected CUI but the span is an exact match.

FOUND_DIR_GRANDPARENT: The recognised CUI is a grandparent of the expected CUI but the span is an exact match.

FOUND_ANY_CHILD: The recognised CUI is a child of the expected CUI but the span is an exact match.

FOUND_CHILD_PARTIAL: The recognised CUI is a child yet the match is only partial (smaller/bigger/partial).

FOUND_OTHER: Found another CUI in the same span.

FAIL: The concept was not recognised in any meaningful way.

has_correct_cui()

Whether the finding found the correct concept.

Returns:: bool – Whether the correct concept was found.
Return type:: bool

classmethod determine(exp_cui, exp_start, exp_end, tl, found_entities, strict_only=False, check_children=True, check_parent=True, check_grandparent=True)

Determine the finding type based on the input

Parameters:

exp_cui (str) – Expected CUI.
exp_start (int) – Expected span start.
exp_end (int) – Expected span end.
tl (TranslationLayer) – The translation layer.
found_entities (dict[int, Entity]) – The entities found by the model.
strict_only (bool) – Whether to use a strict-only mode (either identical or fail). Defaults to False.
check_children (bool) – Whether to check the children. Defaults to True.
check_parent (bool) – Whether to check for parent(s). Defaults to True.
check_grandparent (bool) – Whether to check for grandparent(s). Defaults to True.

Returns:

tuple[‘Finding’, Optional[str]] – The type of finding determined, and the alternative.

Return type:

tuple[Finding, Optional[str]]

__new__(value)

_generate_next_value_(start, count, last_values)

Generate the next value when not given.

name: the name of the member start: the initial start value or None count: the number of existing members last_value: the last value assigned or None

classmethod _missing_(value)

__repr__()

__str__()

__dir__(): Returns all members and all public methods

__format__(format_spec): Returns format using actual value type unless __str__ has been overridden.

__hash__()

__reduce_ex__(proto)

name(): The name of the Enum member.

value(): The value of the Enum member.

medcat.utils.regression.regression_checker.STRICTNESS_MATRIX: dict[Strictness, set[Finding]]

medcat.utils.regression.regression_checker.logger

medcat.utils.regression.regression_checker.DEFAULT_TEST_SUITE_PATH

medcat.utils.regression.regression_checker.show_description()

medcat.utils.regression.regression_checker.main(model_pack_dir, test_suite_file, phrases=False, hide_empty=False, examples_strictness_str='STRICTEST', jsonpath=None, overwrite=False, jsonindent=None, strictness_str='NORMAL', max_phrase_length=80, use_mct_export=False, mct_export_yaml_path=None, only_mct_export_conversion=False, only_describe=False, require_fully_correct=False, edit_distance=(0, 0, 0))

Check test suite against the specifeid model pack.

Parameters:

model_pack_dir (Path) – The path to the model pack
test_suite_file (Path) – The path to the test suite YAML
phrases (bool) – Whether to show per-phrase information in a report
hide_empty (bool) – Whether to hide empty cases in a report
examples_strictness_str (str) – The example strictness string. Defaults to STRICTEST. NOTE: If you set this to ‘None’, examples will be omitted.
jsonpath (Optional[Path]) – The json path to save the report to (if specified)
overwrite (bool) – Whether to overwrite the file if it exists. Defaults to False
jsonindent (int) – The indentation for json objects. Defaults to 0
strictness_str (str) – The strictness name. Defaults to NORMAL.
max_phrase_length (int) – The maximum phrase length in examples. Defaults to 80.
use_mct_export (bool) – Whether to use a MedCATtrainer export as input. Defaults to False.
mct_export_yaml_path (str) – The (optional) path the converted MCT export should be saved as YAML at. If not set (or None), the MCT export is not saved in YAML format. Defaults to None.
only_mct_export_conversion (bool) – Whether to only deal with the MCT export conversion. I.e exit when MCT export conversion is done. Defaults to False.
only_describe (bool) – Whether to only describe the finding options and exit. Defaults to False.
require_fully_correct (bool) – Whether all cases are required to be correct. If set to True, an exit-status of 1 is returned unless all (sub)cases are correct. Defaults to False.
edit_distance (tuple[int, int, int]) – The edit distance, the random seed, and the number of edited names to pick for each of the names. If set to non-0, the specified number of splits, deletes, transposes, replaces, or inserts are done to the each name. This can be useful for looking at the capability of identifying typos in text. However, this can make hte process a lot slower as a result. Defaults to (0, 0, 0).

Raises:

ValueError – If unable to overwrite file or folder does not exist.

Return type:

None

medcat.utils.regression.regression_checker.tuple3_parser(arg)

Parameters:: arg (str)
Return type:: tuple[int, int, int]

medcat.utils.regression.regression_checker.parser