medcat.utils.regression.regression_checker

Attributes

STRICTNESS_MATRIX

logger

DEFAULT_TEST_SUITE_PATH

parser

Classes

CAT

This is a collection of serialisable model parts.

RegressionSuite

The regression checker.

TranslationLayer

The translation layer for translating:

Strictness

The total strictness on which to judge the results.

Finding

Describes whether or how the finding verified.

Functions

show_description()

main(model_pack_dir, test_suite_file[, phrases, ...])

Check test suite against the specifeid model pack.

tuple3_parser(arg)

Module Contents

class medcat.utils.regression.regression_checker.CAT(cdb, vocab=None, config=None, model_load_path=None)

Bases: medcat.storage.serialisables.AbstractSerialisable

This is a collection of serialisable model parts.

Parameters:
__init__(cdb, vocab=None, config=None, model_load_path=None)
Parameters:
Return type:

None

cdb
vocab = None
config = None
_trainer: medcat.trainer.Trainer | None = None
_pipeline
usage_monitor
_recreate_pipe(model_load_path=None)
Parameters:

model_load_path (Optional[str])

Return type:

medcat.pipeline.pipeline.Pipeline

classmethod get_init_attrs()
Return type:

list[str]

classmethod ignore_attrs()
Return type:

list[str]

__call__(text)
Parameters:

text (str)

Return type:

Optional[medcat.tokenizing.tokens.MutableDocument]

_ensure_not_training()

Method to ensure config is not set to train.

config.components.linking.train should only be True while training and not during inference. This aalso corrects the setting if necessary.

Return type:

None

get_entities(text: str, only_cui: Literal[False] = False) medcat.data.entities.Entities
get_entities(text: str, only_cui: Literal[True] = True) medcat.data.entities.OnlyCUIEntities
get_entities(text: str, only_cui: bool = False) dict | medcat.data.entities.Entities | medcat.data.entities.OnlyCUIEntities

Get the entities recognised and linked within the provided text.

This will run the text through the pipeline and annotated the recognised and linked entities.

Parameters:
  • text (str) – The text to use.

  • only_cui (bool, optional) – Whether to only output the CUIs rather than the entire context. Defaults to False.

Returns:

Union[dict, Entities, OnlyCUIEntities] – The entities found and linked within the text.

_mp_worker_func(texts_and_indices)
Parameters:

texts_and_indices (list[tuple[str, str, bool]])

Return type:

list[tuple[str, str, Union[dict, medcat.data.entities.Entities, medcat.data.entities.OnlyCUIEntities]]]

_generate_batches_by_char_length(text_iter, batch_size_chars, only_cui)
Parameters:
  • text_iter (Union[Iterator[str], Iterator[tuple[str, str]]])

  • batch_size_chars (int)

  • only_cui (bool)

Return type:

Iterator[list[tuple[str, str, bool]]]

_generate_batches(text_iter, batch_size, batch_size_chars, only_cui)
Parameters:
  • text_iter (Union[Iterator[str], Iterator[tuple[str, str]]])

  • batch_size (int)

  • batch_size_chars (int)

  • only_cui (bool)

Return type:

Iterator[list[tuple[str, str, bool]]]

_generate_simple_batches(text_iter, batch_size, only_cui)
Parameters:
  • text_iter (Union[Iterator[str], Iterator[tuple[str, str]]])

  • batch_size (int)

  • only_cui (bool)

Return type:

Iterator[list[tuple[str, str, bool]]]

_mp_one_batch_per_process(executor, batch_iter, external_processes)
Parameters:
  • executor (concurrent.futures.ProcessPoolExecutor)

  • batch_iter (Iterator[list[tuple[str, str, bool]]])

  • external_processes (int)

Return type:

Iterator[tuple[str, Union[dict, medcat.data.entities.Entities, medcat.data.entities.OnlyCUIEntities]]]

get_entities_multi_texts(texts, only_cui=False, n_process=1, batch_size=-1, batch_size_chars=1000000)

Get entities from multiple texts (potentially in parallel).

If n_process > 1, n_process - 1 new processes will be created and data will be processed on those as well as the main process in parallel.

Parameters:
  • texts (Union[Iterable[str], Iterable[tuple[str, str]]]) – The input text. Either an iterable of raw text or one with in the format of (text_index, text).

  • only_cui (bool) – Whether to only return CUIs rather than other information like start/end and annotated value. Defaults to False.

  • n_process (int) – Number of processes to use. Defaults to 1.

  • batch_size (int) – The number of texts to batch at a time. A batch of the specified size will be given to each worker process. Defaults to -1 and in this case the character count will be used instead.

  • batch_size_chars (int) – The maximum number of characters to process in a batch. Each process will be given batch of texts with a total number of characters not exceeding this value. Defaults to 1,000,000 characters. Set to -1 to disable.

Yields:

Iterator[tuple[str, Union[dict, Entities, OnlyCUIEntities]]] – The results in the format of (text_index, entities).

Return type:

Iterator[tuple[str, Union[dict, medcat.data.entities.Entities, medcat.data.entities.OnlyCUIEntities]]]

_get_entity(ent, doc_tokens, cui)
Parameters:
Return type:

medcat.data.entities.Entity

get_addon_output(ent)

Get the addon output for the entity.

This includes a key-value pair for each addon that provides some. Sometimes same-type addons may combine their output under the same key.

Parameters:

ent (MutableEntity) – The entity in quesiton.

Raises:

ValueError – If unable to merge multiple addon output.

Returns:

dict[str, dict] – All the addon output.

Return type:

dict[str, dict]

_doc_to_out_entity(ent, doc_tokens, only_cui)
Parameters:
Return type:

tuple[int, Union[medcat.data.entities.Entity, str]]

_doc_to_out(doc, only_cui, out_with_text=False)
Parameters:
Return type:

Union[medcat.data.entities.Entities, medcat.data.entities.OnlyCUIEntities]

property trainer

The trainer object.

save_model_pack(target_folder, pack_name=DEFAULT_PACK_NAME, serialiser_type='dill', make_archive=True, only_archive=False, add_hash_to_pack_name=True, change_description=None)

Save model pack.

The resulting model pack name will have the hash of the model pack in its name if (and only if) the default model pack name is used.

Parameters:
  • target_folder (str) – The folder to save the pack in.

  • pack_name (str, optional) – The model pack name. Defaults to DEFAULT_PACK_NAME.

  • serialiser_type (Union[str, AvailableSerialisers], optional) – The serialiser type. Defaults to ‘dill’.

  • make_archive (bool) – Whether to make the arhive /.zip file. Defaults to True.

  • only_archive (bool) – Whether to clear the non-compressed folder. Defaults to False.

  • add_hash_to_pack_name (bool) – Whether to add the hash to the pack name. This is only relevant if pack_name is specified. Defaults to True.

  • change_description (Optional[str]) – If provided, this the description will be added to the model description. Defaults to None.

Returns:

str – The final model pack path.

Return type:

str

_get_hash()
Return type:

str

_versioning(change_description)
Parameters:

change_description (Optional[str])

Return type:

str

classmethod attempt_unpack(zip_path)

Attempt unpack the zip to a folder and get the model pack path.

If the folder already exists, no unpacking is done.

Parameters:

zip_path (str) – The ZIP path

Returns:

str – The model pack path

Return type:

str

classmethod load_model_pack(model_pack_path)

Load the model pack from file.

Parameters:

model_pack_path (str) – The model pack path.

Raises:

ValueError – If the saved data does not represent a model pack.

Returns:

CAT – The loaded model pack.

Return type:

CAT

classmethod load_cdb(model_pack_path)

Loads the concept database from the provided model pack path

Parameters:

model_pack_path (str) – path to model pack, zip or dir.

Returns:

CDB – The loaded concept database

Return type:

medcat.cdb.CDB

get_model_card(as_dict: Literal[True]) medcat.data.model_card.ModelCard
get_model_card(as_dict: Literal[False]) str

Get the model card either a (nested) dict or a json string.

Parameters:

as_dict (bool) – Whether to return as dict. Defaults to False.

Returns:

Union[str, ModelCard] – The model card.

__eq__(other)
Parameters:

other (Any)

Return type:

bool

add_addon(addon)
Parameters:

addon (medcat.components.addons.addons.AddonComponent)

Return type:

None

get_strategy()
Return type:

SerialisingStrategy

classmethod include_properties()
Return type:

list[str]

class medcat.utils.regression.regression_checker.RegressionSuite(cases, metadata, name)

The regression checker. This is used to check a bunch of regression cases at once against a model.

Parameters:
  • cases (list[RegressionCase]) – The list of regression cases

  • metadata (MetaData) – The metadata for the regression suite

  • use_report (bool) – Whether or not to use the report functionality. Defaults to False.

  • name (str)

__init__(cases, metadata, name)
Parameters:
Return type:

None

cases: list[RegressionCase]
report
metadata
get_all_distinct_cases(translation, edit_distance, use_diacritics)

Gets all the distinct cases for this regression suite.

While distinct cases can be determined without the translation layer, including it here simplifies the process.

Parameters:
  • translation (TranslationLayer) – The translation layer.

  • edit_distance (tuple[int, int, int]) – The edit distance(s) to try. Defaults to (0, 0, 0).

  • use_diacritics (bool) – Whether to use diacritics for edit distance.

Yields:

Iterator[tuple[RegressionCase, Iterator[FinalTarget]]] – The generator of the regression case along with its corresponding sub-cases.

Return type:

Iterator[tuple[RegressionCase, Iterator[medcat.utils.regression.targeting.FinalTarget]]]

estimate_total_distinct_cases()
Return type:

int

iter_subcases(translation, show_progress=True, edit_distance=(0, 0, 0), use_diacritics=False)

Iterate over all the sub-cases.

Each sub-case present a unique target (phrase, concept, name) on the corresponding regression case.

Parameters:
  • translation (TranslationLayer) – The translation layer.

  • show_progress (bool) – Whether to show progress. Defaults to True.

  • edit_distance (tuple[int, int, int]) – The edit distance(s) to try. Defaults to (0, 0, 0).

  • use_diacritics (bool) – Whether to use diacritics for edit distance.

Yields:

Iterator[tuple[RegressionCase, FinalTarget]]

The generator of the

regression case along with each of the final target sub-cases.

Return type:

Iterator[tuple[RegressionCase, medcat.utils.regression.targeting.FinalTarget]]

check_model(cat, translation, edit_distance=(0, 0, 0), use_diacritics=False)

Checks model and generates a report

Parameters:
  • cat (CAT) – The model to check against

  • translation (TranslationLayer) – The translation layer

  • edit_distance (tuple[int, int, int]) – The edit distance of the names. Defaults to (0, 0, 0).

  • use_diacritics (bool) – Whether to use diacritics for edit distance.

Returns:

MultiDescriptor – A report description

Return type:

medcat.utils.regression.results.MultiDescriptor

__str__()
Return type:

str

__repr__()
Return type:

str

to_dict()

Converts the RegressionChecker to dict for serialisation.

Returns:

dict – The dict representation

Return type:

dict

to_yaml()

Convert the RegressionChecker to YAML string.

Returns:

str – The YAML representation

Return type:

str

__eq__(other)
Parameters:

other (object)

Return type:

bool

classmethod from_dict(in_dict, name)

Construct a RegressionChecker from a dict.

Most of the parsing is handled in RegressionChecker.from_dict. This just assumes that each key in the dict is a name and each value describes a RegressionCase.

Parameters:
  • in_dict (dict) – The input dict.

  • name (str) – The name of the regression suite.

Returns:

RegressionChecker – The built regression checker

Return type:

RegressionSuite

classmethod from_yaml(file_name)

Constructs a RegressionChcker from a YAML file.

The from_dict method is used for the construction from the dict.

Parameters:

file_name (str) – The file name

Returns:

RegressionChecker – The constructed regression checker

Return type:

RegressionSuite

classmethod from_mct_export(file_name)
Parameters:

file_name (str)

Return type:

RegressionSuite

class medcat.utils.regression.regression_checker.TranslationLayer(cui2info, name2info, cui2children, separator, whitespace=' ')

The translation layer for translating: - CUIs to names - names to CUIs - type_ids to CUIs - CUIs to chil CUIs

The idea is to decouple these translations from the CDB instance in case something changes there.

Parameters:
  • cui2info (dict[str, CUIInfo]) – The map from CUI to names

  • name2info (dict[str, NameInfo]) – The map from name to CUIs

  • cui2type_ids (dict[str, set[str]]) – The map from CUI to type_ids

  • cui2children (dict[str, set[str]]) – The map from CUI to child CUIs

  • separator (str)

  • whitespace (str)

__init__(cui2info, name2info, cui2children, separator, whitespace=' ')
Parameters:
Return type:

None

cui2info
name2info
separator
whitespace = ' '
type_id2cuis: dict[str, set[str]]
cui2children
get_names_of(cui, only_prefnames)

Get the preprocessed names of a CUI.

This method preporcesses the names by replacing the separator (generally ~) with the appropriate whitespace (` `).

If the concept is not in the underlying CDB, an empty list is returned.

Parameters:
  • cui (str) – The concept in question.

  • only_prefnames (bool) – Whether to only return a preferred name.

Returns:

list[str] – The list of names.

Return type:

list[str]

get_preferred_name(cui)

Get the preferred name of a concept.

If no preferred name is found, the random ‘first’ name is selected.

Parameters:

cui (str) – The concept ID.

Returns:

str – The preferred name.

Return type:

str

get_first_name(cui)

Get the preprocessed (potentially) arbitrarily first name of the given concept.

If the concept does not exist, the CUI itself is returned.

PS: The “first” name may not be consistent across runs since it relies on set order.

Parameters:

cui (str) – The concept ID.

Returns:

str – The first name.

Return type:

str

get_direct_children(cui)

Get the direct children of a concept.

This means only the children, but not grandchildren.

If the underlying CDB doesn’t list children for this CUI, an empty list is returned.

Parameters:

cui (str) – The concept in question.

Returns:

list[str] – The (potentially empty) list of direct children.

Return type:

list[str]

get_direct_parents(cui)

Get the direct parent(s) of a concept.

PS: This method can be quite a CPU heavy one since it relies

on running through all the parent-children relationships since the child->parent(s) relationship isn’t normally kept track of.

Parameters:

cui (str) – _description_

Returns:

list[str] – _description_

Return type:

list[str]

get_children_of(found_cuis, cui, depth=1)

Get the children of the specifeid CUI in the listed CUIs (if they exist).

Parameters:
  • found_cuis (Iterable[str]) – The list of CUIs to look in

  • cui (str) – The target parent CUI

  • depth (int) – The depth to carry out the search for

Returns:

list[str] – The list of children found

Return type:

list[str]

classmethod from_CDB(cdb)

Construct a TranslationLayer object from a context database (CDB).

This translation layer will refer to the same dicts that the CDB refers to. While there is no obvious reason these should be modified, it’s something to keep in mind.

Parameters:

cdb (CDB) – The CDB

Returns:

TranslationLayer – The subsequent TranslationLayer

Return type:

TranslationLayer

class medcat.utils.regression.regression_checker.Strictness

Bases: enum.Enum

The total strictness on which to judge the results.

STRICTEST

The strictest option which only allows identical findings.

STRICT

A strict option which allows identical or children.

NORMAL

Normal strictness also allows partial overlaps on target concept and children.

LENIENT

Lenient stictness also allows parents and grandparents.

ANYTHING

Anything stricness allows ANY finding.

This would generally only be relevant when disabling examples for results descriptors.

__new__(value)
_generate_next_value_(start, count, last_values)

Generate the next value when not given.

name: the name of the member start: the initial start value or None count: the number of existing members last_value: the last value assigned or None

classmethod _missing_(value)
__repr__()
__str__()
__dir__()

Returns all members and all public methods

__format__(format_spec)

Returns format using actual value type unless __str__ has been overridden.

__hash__()
__reduce_ex__(proto)
name()

The name of the Enum member.

value()

The value of the Enum member.

class medcat.utils.regression.regression_checker.Finding

Bases: enum.Enum

Describes whether or how the finding verified.

The idea is that we know where we expect the entity to be recognised and the enum constants describe how the recognition compared to the expectation.

In essence, we want to know the relative positions of the two pairs of numbers (character numbers): - Expected Start, Expected End - Recognised Start, Recognised End

We can model this as 4 numbers on the number line. And we want to know their position relative to each other. For example, if the expected positions are marked with * and recognised positions with #, we may have something like: ___*__#_______#*______________ Which would indicate that there is a partial, but smaller span recognised.

IDENTICAL

The CUI and the span recognised are identical to what was expected.

BIGGER_SPAN_RIGHT

The CUI is the same, but the recognised span is longer on the right.

If we use the notation from the class doc string, e.g: _*#__*__#

BIGGER_SPAN_LEFT

The CUI is the same, but the recognised span is longer on the left.

If we use the notation from the class doc string, e.g: _#_*__*#_

BIGGER_SPAN_BOTH

The CUI is the same, but the recognised span is longer on both sides.

If we use the notation from the class doc string, e.g: _#__*__*__#_

SMALLER_SPAN

The CUI is the same, but the recognised span is smaller.

If we use the notation from the class doc string, e.g: _*_#_#_*_ (neither start nor end match) _*#_#_*__ (start matches, but end is before expected) _*__#_#*_ (end matches, but start is after expected)

PARTIAL_OVERLAP

The CUI is the same, but the span overlaps partially.

If we use the notation from the class doc string, e.g: _*_#__*_#_ (starts between expected start and end, but ends beyond) _#_*_#_*__ (start before expected start, but ends between expected start and end)

FOUND_DIR_PARENT

The recognised CUI is a parent of the expected CUI but the span is an exact match.

FOUND_DIR_GRANDPARENT

The recognised CUI is a grandparent of the expected CUI but the span is an exact match.

FOUND_ANY_CHILD

The recognised CUI is a child of the expected CUI but the span is an exact match.

FOUND_CHILD_PARTIAL

The recognised CUI is a child yet the match is only partial (smaller/bigger/partial).

FOUND_OTHER

Found another CUI in the same span.

FAIL

The concept was not recognised in any meaningful way.

has_correct_cui()

Whether the finding found the correct concept.

Returns:

bool – Whether the correct concept was found.

Return type:

bool

classmethod determine(exp_cui, exp_start, exp_end, tl, found_entities, strict_only=False, check_children=True, check_parent=True, check_grandparent=True)

Determine the finding type based on the input

Parameters:
  • exp_cui (str) – Expected CUI.

  • exp_start (int) – Expected span start.

  • exp_end (int) – Expected span end.

  • tl (TranslationLayer) – The translation layer.

  • found_entities (dict[int, Entity]) – The entities found by the model.

  • strict_only (bool) – Whether to use a strict-only mode (either identical or fail). Defaults to False.

  • check_children (bool) – Whether to check the children. Defaults to True.

  • check_parent (bool) – Whether to check for parent(s). Defaults to True.

  • check_grandparent (bool) – Whether to check for grandparent(s). Defaults to True.

Returns:

tuple[‘Finding’, Optional[str]] – The type of finding determined, and the alternative.

Return type:

tuple[Finding, Optional[str]]

__new__(value)
_generate_next_value_(start, count, last_values)

Generate the next value when not given.

name: the name of the member start: the initial start value or None count: the number of existing members last_value: the last value assigned or None

classmethod _missing_(value)
__repr__()
__str__()
__dir__()

Returns all members and all public methods

__format__(format_spec)

Returns format using actual value type unless __str__ has been overridden.

__hash__()
__reduce_ex__(proto)
name()

The name of the Enum member.

value()

The value of the Enum member.

medcat.utils.regression.regression_checker.STRICTNESS_MATRIX: dict[Strictness, set[Finding]]
medcat.utils.regression.regression_checker.logger
medcat.utils.regression.regression_checker.DEFAULT_TEST_SUITE_PATH
medcat.utils.regression.regression_checker.show_description()
medcat.utils.regression.regression_checker.main(model_pack_dir, test_suite_file, phrases=False, hide_empty=False, examples_strictness_str='STRICTEST', jsonpath=None, overwrite=False, jsonindent=None, strictness_str='NORMAL', max_phrase_length=80, use_mct_export=False, mct_export_yaml_path=None, only_mct_export_conversion=False, only_describe=False, require_fully_correct=False, edit_distance=(0, 0, 0))

Check test suite against the specifeid model pack.

Parameters:
  • model_pack_dir (Path) – The path to the model pack

  • test_suite_file (Path) – The path to the test suite YAML

  • phrases (bool) – Whether to show per-phrase information in a report

  • hide_empty (bool) – Whether to hide empty cases in a report

  • examples_strictness_str (str) – The example strictness string. Defaults to STRICTEST. NOTE: If you set this to ‘None’, examples will be omitted.

  • jsonpath (Optional[Path]) – The json path to save the report to (if specified)

  • overwrite (bool) – Whether to overwrite the file if it exists. Defaults to False

  • jsonindent (int) – The indentation for json objects. Defaults to 0

  • strictness_str (str) – The strictness name. Defaults to NORMAL.

  • max_phrase_length (int) – The maximum phrase length in examples. Defaults to 80.

  • use_mct_export (bool) – Whether to use a MedCATtrainer export as input. Defaults to False.

  • mct_export_yaml_path (str) – The (optional) path the converted MCT export should be saved as YAML at. If not set (or None), the MCT export is not saved in YAML format. Defaults to None.

  • only_mct_export_conversion (bool) – Whether to only deal with the MCT export conversion. I.e exit when MCT export conversion is done. Defaults to False.

  • only_describe (bool) – Whether to only describe the finding options and exit. Defaults to False.

  • require_fully_correct (bool) – Whether all cases are required to be correct. If set to True, an exit-status of 1 is returned unless all (sub)cases are correct. Defaults to False.

  • edit_distance (tuple[int, int, int]) – The edit distance, the random seed, and the number of edited names to pick for each of the names. If set to non-0, the specified number of splits, deletes, transposes, replaces, or inserts are done to the each name. This can be useful for looking at the capability of identifying typos in text. However, this can make hte process a lot slower as a result. Defaults to (0, 0, 0).

Raises:

ValueError – If unable to overwrite file or folder does not exist.

Return type:

None

medcat.utils.regression.regression_checker.tuple3_parser(arg)
Parameters:

arg (str)

Return type:

tuple[int, int, int]

medcat.utils.regression.regression_checker.parser