medcat.utils.regression.regression_checker
Attributes
Classes
This is a collection of serialisable model parts. |
|
The regression checker. |
|
The translation layer for translating: |
|
The total strictness on which to judge the results. |
|
Describes whether or how the finding verified. |
Functions
|
Check test suite against the specifeid model pack. |
|
Module Contents
- class medcat.utils.regression.regression_checker.CAT(cdb, vocab=None, config=None, model_load_path=None)
Bases:
medcat.storage.serialisables.AbstractSerialisableThis is a collection of serialisable model parts.
- Parameters:
cdb (medcat.cdb.CDB)
vocab (Union[medcat.vocab.Vocab, None])
config (Optional[medcat.config.config.Config])
model_load_path (Optional[str])
- __init__(cdb, vocab=None, config=None, model_load_path=None)
- Parameters:
cdb (medcat.cdb.CDB)
vocab (Union[medcat.vocab.Vocab, None])
config (Optional[medcat.config.config.Config])
model_load_path (Optional[str])
- Return type:
None
- cdb
- vocab = None
- config = None
- _trainer: medcat.trainer.Trainer | None = None
- _pipeline
- usage_monitor
- _recreate_pipe(model_load_path=None)
- Parameters:
model_load_path (Optional[str])
- Return type:
- classmethod get_init_attrs()
- Return type:
list[str]
- classmethod ignore_attrs()
- Return type:
list[str]
- __call__(text)
- Parameters:
text (str)
- Return type:
Optional[medcat.tokenizing.tokens.MutableDocument]
- _ensure_not_training()
Method to ensure config is not set to train.
config.components.linking.train should only be True while training and not during inference. This aalso corrects the setting if necessary.
- Return type:
None
- get_entities(text: str, only_cui: Literal[False] = False) medcat.data.entities.Entities
- get_entities(text: str, only_cui: Literal[True] = True) medcat.data.entities.OnlyCUIEntities
- get_entities(text: str, only_cui: bool = False) dict | medcat.data.entities.Entities | medcat.data.entities.OnlyCUIEntities
Get the entities recognised and linked within the provided text.
This will run the text through the pipeline and annotated the recognised and linked entities.
- Parameters:
text (str) – The text to use.
only_cui (bool, optional) – Whether to only output the CUIs rather than the entire context. Defaults to False.
- Returns:
Union[dict, Entities, OnlyCUIEntities] – The entities found and linked within the text.
- _mp_worker_func(texts_and_indices)
- Parameters:
texts_and_indices (list[tuple[str, str, bool]])
- Return type:
list[tuple[str, str, Union[dict, medcat.data.entities.Entities, medcat.data.entities.OnlyCUIEntities]]]
- _generate_batches_by_char_length(text_iter, batch_size_chars, only_cui)
- Parameters:
text_iter (Union[Iterator[str], Iterator[tuple[str, str]]])
batch_size_chars (int)
only_cui (bool)
- Return type:
Iterator[list[tuple[str, str, bool]]]
- _generate_batches(text_iter, batch_size, batch_size_chars, only_cui)
- Parameters:
text_iter (Union[Iterator[str], Iterator[tuple[str, str]]])
batch_size (int)
batch_size_chars (int)
only_cui (bool)
- Return type:
Iterator[list[tuple[str, str, bool]]]
- _generate_simple_batches(text_iter, batch_size, only_cui)
- Parameters:
text_iter (Union[Iterator[str], Iterator[tuple[str, str]]])
batch_size (int)
only_cui (bool)
- Return type:
Iterator[list[tuple[str, str, bool]]]
- _mp_one_batch_per_process(executor, batch_iter, external_processes)
- Parameters:
executor (concurrent.futures.ProcessPoolExecutor)
batch_iter (Iterator[list[tuple[str, str, bool]]])
external_processes (int)
- Return type:
Iterator[tuple[str, Union[dict, medcat.data.entities.Entities, medcat.data.entities.OnlyCUIEntities]]]
- get_entities_multi_texts(texts, only_cui=False, n_process=1, batch_size=-1, batch_size_chars=1000000)
Get entities from multiple texts (potentially in parallel).
If n_process > 1, n_process - 1 new processes will be created and data will be processed on those as well as the main process in parallel.
- Parameters:
texts (Union[Iterable[str], Iterable[tuple[str, str]]]) – The input text. Either an iterable of raw text or one with in the format of (text_index, text).
only_cui (bool) – Whether to only return CUIs rather than other information like start/end and annotated value. Defaults to False.
n_process (int) – Number of processes to use. Defaults to 1.
batch_size (int) – The number of texts to batch at a time. A batch of the specified size will be given to each worker process. Defaults to -1 and in this case the character count will be used instead.
batch_size_chars (int) – The maximum number of characters to process in a batch. Each process will be given batch of texts with a total number of characters not exceeding this value. Defaults to 1,000,000 characters. Set to -1 to disable.
- Yields:
Iterator[tuple[str, Union[dict, Entities, OnlyCUIEntities]]] – The results in the format of (text_index, entities).
- Return type:
Iterator[tuple[str, Union[dict, medcat.data.entities.Entities, medcat.data.entities.OnlyCUIEntities]]]
- _get_entity(ent, doc_tokens, cui)
- Parameters:
doc_tokens (list[str])
cui (str)
- Return type:
- get_addon_output(ent)
Get the addon output for the entity.
This includes a key-value pair for each addon that provides some. Sometimes same-type addons may combine their output under the same key.
- Parameters:
ent (MutableEntity) – The entity in quesiton.
- Raises:
ValueError – If unable to merge multiple addon output.
- Returns:
dict[str, dict] – All the addon output.
- Return type:
dict[str, dict]
- _doc_to_out_entity(ent, doc_tokens, only_cui)
- Parameters:
doc_tokens (list[str])
only_cui (bool)
- Return type:
tuple[int, Union[medcat.data.entities.Entity, str]]
- _doc_to_out(doc, only_cui, out_with_text=False)
- Parameters:
only_cui (bool)
out_with_text (bool)
- Return type:
Union[medcat.data.entities.Entities, medcat.data.entities.OnlyCUIEntities]
- property trainer
The trainer object.
- save_model_pack(target_folder, pack_name=DEFAULT_PACK_NAME, serialiser_type='dill', make_archive=True, only_archive=False, add_hash_to_pack_name=True, change_description=None)
Save model pack.
The resulting model pack name will have the hash of the model pack in its name if (and only if) the default model pack name is used.
- Parameters:
target_folder (str) – The folder to save the pack in.
pack_name (str, optional) – The model pack name. Defaults to DEFAULT_PACK_NAME.
serialiser_type (Union[str, AvailableSerialisers], optional) – The serialiser type. Defaults to ‘dill’.
make_archive (bool) – Whether to make the arhive /.zip file. Defaults to True.
only_archive (bool) – Whether to clear the non-compressed folder. Defaults to False.
add_hash_to_pack_name (bool) – Whether to add the hash to the pack name. This is only relevant if pack_name is specified. Defaults to True.
change_description (Optional[str]) – If provided, this the description will be added to the model description. Defaults to None.
- Returns:
str – The final model pack path.
- Return type:
str
- _get_hash()
- Return type:
str
- _versioning(change_description)
- Parameters:
change_description (Optional[str])
- Return type:
str
- classmethod attempt_unpack(zip_path)
Attempt unpack the zip to a folder and get the model pack path.
If the folder already exists, no unpacking is done.
- Parameters:
zip_path (str) – The ZIP path
- Returns:
str – The model pack path
- Return type:
str
- classmethod load_model_pack(model_pack_path)
Load the model pack from file.
- Parameters:
model_pack_path (str) – The model pack path.
- Raises:
ValueError – If the saved data does not represent a model pack.
- Returns:
CAT – The loaded model pack.
- Return type:
- classmethod load_cdb(model_pack_path)
Loads the concept database from the provided model pack path
- Parameters:
model_pack_path (str) – path to model pack, zip or dir.
- Returns:
CDB – The loaded concept database
- Return type:
- get_model_card(as_dict: Literal[True]) medcat.data.model_card.ModelCard
- get_model_card(as_dict: Literal[False]) str
Get the model card either a (nested) dict or a json string.
- Parameters:
as_dict (bool) – Whether to return as dict. Defaults to False.
- Returns:
Union[str, ModelCard] – The model card.
- __eq__(other)
- Parameters:
other (Any)
- Return type:
bool
- add_addon(addon)
- Parameters:
- Return type:
None
- get_strategy()
- Return type:
- classmethod include_properties()
- Return type:
list[str]
- class medcat.utils.regression.regression_checker.RegressionSuite(cases, metadata, name)
The regression checker. This is used to check a bunch of regression cases at once against a model.
- Parameters:
cases (list[RegressionCase]) – The list of regression cases
metadata (MetaData) – The metadata for the regression suite
use_report (bool) – Whether or not to use the report functionality. Defaults to False.
name (str)
- __init__(cases, metadata, name)
- Parameters:
cases (list[RegressionCase])
metadata (MetaData)
name (str)
- Return type:
None
- cases: list[RegressionCase]
- report
- metadata
- get_all_distinct_cases(translation, edit_distance, use_diacritics)
Gets all the distinct cases for this regression suite.
While distinct cases can be determined without the translation layer, including it here simplifies the process.
- Parameters:
translation (TranslationLayer) – The translation layer.
edit_distance (tuple[int, int, int]) – The edit distance(s) to try. Defaults to (0, 0, 0).
use_diacritics (bool) – Whether to use diacritics for edit distance.
- Yields:
Iterator[tuple[RegressionCase, Iterator[FinalTarget]]] – The generator of the regression case along with its corresponding sub-cases.
- Return type:
Iterator[tuple[RegressionCase, Iterator[medcat.utils.regression.targeting.FinalTarget]]]
- estimate_total_distinct_cases()
- Return type:
int
- iter_subcases(translation, show_progress=True, edit_distance=(0, 0, 0), use_diacritics=False)
Iterate over all the sub-cases.
Each sub-case present a unique target (phrase, concept, name) on the corresponding regression case.
- Parameters:
translation (TranslationLayer) – The translation layer.
show_progress (bool) – Whether to show progress. Defaults to True.
edit_distance (tuple[int, int, int]) – The edit distance(s) to try. Defaults to (0, 0, 0).
use_diacritics (bool) – Whether to use diacritics for edit distance.
- Yields:
Iterator[tuple[RegressionCase, FinalTarget]] –
- The generator of the
regression case along with each of the final target sub-cases.
- Return type:
Iterator[tuple[RegressionCase, medcat.utils.regression.targeting.FinalTarget]]
- check_model(cat, translation, edit_distance=(0, 0, 0), use_diacritics=False)
Checks model and generates a report
- Parameters:
cat (CAT) – The model to check against
translation (TranslationLayer) – The translation layer
edit_distance (tuple[int, int, int]) – The edit distance of the names. Defaults to (0, 0, 0).
use_diacritics (bool) – Whether to use diacritics for edit distance.
- Returns:
MultiDescriptor – A report description
- Return type:
- __str__()
- Return type:
str
- __repr__()
- Return type:
str
- to_dict()
Converts the RegressionChecker to dict for serialisation.
- Returns:
dict – The dict representation
- Return type:
dict
- to_yaml()
Convert the RegressionChecker to YAML string.
- Returns:
str – The YAML representation
- Return type:
str
- __eq__(other)
- Parameters:
other (object)
- Return type:
bool
- classmethod from_dict(in_dict, name)
Construct a RegressionChecker from a dict.
Most of the parsing is handled in RegressionChecker.from_dict. This just assumes that each key in the dict is a name and each value describes a RegressionCase.
- Parameters:
in_dict (dict) – The input dict.
name (str) – The name of the regression suite.
- Returns:
RegressionChecker – The built regression checker
- Return type:
- classmethod from_yaml(file_name)
Constructs a RegressionChcker from a YAML file.
The from_dict method is used for the construction from the dict.
- Parameters:
file_name (str) – The file name
- Returns:
RegressionChecker – The constructed regression checker
- Return type:
- classmethod from_mct_export(file_name)
- Parameters:
file_name (str)
- Return type:
- class medcat.utils.regression.regression_checker.TranslationLayer(cui2info, name2info, cui2children, separator, whitespace=' ')
The translation layer for translating: - CUIs to names - names to CUIs - type_ids to CUIs - CUIs to chil CUIs
The idea is to decouple these translations from the CDB instance in case something changes there.
- Parameters:
- __init__(cui2info, name2info, cui2children, separator, whitespace=' ')
- Parameters:
cui2info (dict[str, medcat.cdb.concepts.CUIInfo])
name2info (dict[str, medcat.cdb.concepts.NameInfo])
cui2children (dict[str, set[str]])
separator (str)
whitespace (str)
- Return type:
None
- cui2info
- name2info
- separator
- whitespace = ' '
- type_id2cuis: dict[str, set[str]]
- cui2children
- get_names_of(cui, only_prefnames)
Get the preprocessed names of a CUI.
This method preporcesses the names by replacing the separator (generally ~) with the appropriate whitespace (` `).
If the concept is not in the underlying CDB, an empty list is returned.
- Parameters:
cui (str) – The concept in question.
only_prefnames (bool) – Whether to only return a preferred name.
- Returns:
list[str] – The list of names.
- Return type:
list[str]
- get_preferred_name(cui)
Get the preferred name of a concept.
If no preferred name is found, the random ‘first’ name is selected.
- Parameters:
cui (str) – The concept ID.
- Returns:
str – The preferred name.
- Return type:
str
- get_first_name(cui)
Get the preprocessed (potentially) arbitrarily first name of the given concept.
If the concept does not exist, the CUI itself is returned.
PS: The “first” name may not be consistent across runs since it relies on set order.
- Parameters:
cui (str) – The concept ID.
- Returns:
str – The first name.
- Return type:
str
- get_direct_children(cui)
Get the direct children of a concept.
This means only the children, but not grandchildren.
If the underlying CDB doesn’t list children for this CUI, an empty list is returned.
- Parameters:
cui (str) – The concept in question.
- Returns:
list[str] – The (potentially empty) list of direct children.
- Return type:
list[str]
- get_direct_parents(cui)
Get the direct parent(s) of a concept.
- PS: This method can be quite a CPU heavy one since it relies
on running through all the parent-children relationships since the child->parent(s) relationship isn’t normally kept track of.
- Parameters:
cui (str) – _description_
- Returns:
list[str] – _description_
- Return type:
list[str]
- get_children_of(found_cuis, cui, depth=1)
Get the children of the specifeid CUI in the listed CUIs (if they exist).
- Parameters:
found_cuis (Iterable[str]) – The list of CUIs to look in
cui (str) – The target parent CUI
depth (int) – The depth to carry out the search for
- Returns:
list[str] – The list of children found
- Return type:
list[str]
- classmethod from_CDB(cdb)
Construct a TranslationLayer object from a context database (CDB).
This translation layer will refer to the same dicts that the CDB refers to. While there is no obvious reason these should be modified, it’s something to keep in mind.
- Parameters:
cdb (CDB) – The CDB
- Returns:
TranslationLayer – The subsequent TranslationLayer
- Return type:
- class medcat.utils.regression.regression_checker.Strictness
Bases:
enum.EnumThe total strictness on which to judge the results.
- STRICTEST
The strictest option which only allows identical findings.
- STRICT
A strict option which allows identical or children.
- NORMAL
Normal strictness also allows partial overlaps on target concept and children.
- LENIENT
Lenient stictness also allows parents and grandparents.
- ANYTHING
Anything stricness allows ANY finding.
This would generally only be relevant when disabling examples for results descriptors.
- __new__(value)
- _generate_next_value_(start, count, last_values)
Generate the next value when not given.
name: the name of the member start: the initial start value or None count: the number of existing members last_value: the last value assigned or None
- classmethod _missing_(value)
- __repr__()
- __str__()
- __dir__()
Returns all members and all public methods
- __format__(format_spec)
Returns format using actual value type unless __str__ has been overridden.
- __hash__()
- __reduce_ex__(proto)
- name()
The name of the Enum member.
- value()
The value of the Enum member.
- class medcat.utils.regression.regression_checker.Finding
Bases:
enum.EnumDescribes whether or how the finding verified.
The idea is that we know where we expect the entity to be recognised and the enum constants describe how the recognition compared to the expectation.
In essence, we want to know the relative positions of the two pairs of numbers (character numbers): - Expected Start, Expected End - Recognised Start, Recognised End
We can model this as 4 numbers on the number line. And we want to know their position relative to each other. For example, if the expected positions are marked with * and recognised positions with #, we may have something like: ___*__#_______#*______________ Which would indicate that there is a partial, but smaller span recognised.
- IDENTICAL
The CUI and the span recognised are identical to what was expected.
- BIGGER_SPAN_RIGHT
The CUI is the same, but the recognised span is longer on the right.
If we use the notation from the class doc string, e.g: _*#__*__#
- BIGGER_SPAN_LEFT
The CUI is the same, but the recognised span is longer on the left.
If we use the notation from the class doc string, e.g: _#_*__*#_
- BIGGER_SPAN_BOTH
The CUI is the same, but the recognised span is longer on both sides.
If we use the notation from the class doc string, e.g: _#__*__*__#_
- SMALLER_SPAN
The CUI is the same, but the recognised span is smaller.
If we use the notation from the class doc string, e.g: _*_#_#_*_ (neither start nor end match) _*#_#_*__ (start matches, but end is before expected) _*__#_#*_ (end matches, but start is after expected)
- PARTIAL_OVERLAP
The CUI is the same, but the span overlaps partially.
If we use the notation from the class doc string, e.g: _*_#__*_#_ (starts between expected start and end, but ends beyond) _#_*_#_*__ (start before expected start, but ends between expected start and end)
- FOUND_DIR_PARENT
The recognised CUI is a parent of the expected CUI but the span is an exact match.
- FOUND_DIR_GRANDPARENT
The recognised CUI is a grandparent of the expected CUI but the span is an exact match.
- FOUND_ANY_CHILD
The recognised CUI is a child of the expected CUI but the span is an exact match.
- FOUND_CHILD_PARTIAL
The recognised CUI is a child yet the match is only partial (smaller/bigger/partial).
- FOUND_OTHER
Found another CUI in the same span.
- FAIL
The concept was not recognised in any meaningful way.
- has_correct_cui()
Whether the finding found the correct concept.
- Returns:
bool – Whether the correct concept was found.
- Return type:
bool
- classmethod determine(exp_cui, exp_start, exp_end, tl, found_entities, strict_only=False, check_children=True, check_parent=True, check_grandparent=True)
Determine the finding type based on the input
- Parameters:
exp_cui (str) – Expected CUI.
exp_start (int) – Expected span start.
exp_end (int) – Expected span end.
tl (TranslationLayer) – The translation layer.
found_entities (dict[int, Entity]) – The entities found by the model.
strict_only (bool) – Whether to use a strict-only mode (either identical or fail). Defaults to False.
check_children (bool) – Whether to check the children. Defaults to True.
check_parent (bool) – Whether to check for parent(s). Defaults to True.
check_grandparent (bool) – Whether to check for grandparent(s). Defaults to True.
- Returns:
tuple[‘Finding’, Optional[str]] – The type of finding determined, and the alternative.
- Return type:
tuple[Finding, Optional[str]]
- __new__(value)
- _generate_next_value_(start, count, last_values)
Generate the next value when not given.
name: the name of the member start: the initial start value or None count: the number of existing members last_value: the last value assigned or None
- classmethod _missing_(value)
- __repr__()
- __str__()
- __dir__()
Returns all members and all public methods
- __format__(format_spec)
Returns format using actual value type unless __str__ has been overridden.
- __hash__()
- __reduce_ex__(proto)
- name()
The name of the Enum member.
- value()
The value of the Enum member.
- medcat.utils.regression.regression_checker.STRICTNESS_MATRIX: dict[Strictness, set[Finding]]
- medcat.utils.regression.regression_checker.logger
- medcat.utils.regression.regression_checker.DEFAULT_TEST_SUITE_PATH
- medcat.utils.regression.regression_checker.show_description()
- medcat.utils.regression.regression_checker.main(model_pack_dir, test_suite_file, phrases=False, hide_empty=False, examples_strictness_str='STRICTEST', jsonpath=None, overwrite=False, jsonindent=None, strictness_str='NORMAL', max_phrase_length=80, use_mct_export=False, mct_export_yaml_path=None, only_mct_export_conversion=False, only_describe=False, require_fully_correct=False, edit_distance=(0, 0, 0))
Check test suite against the specifeid model pack.
- Parameters:
model_pack_dir (Path) – The path to the model pack
test_suite_file (Path) – The path to the test suite YAML
phrases (bool) – Whether to show per-phrase information in a report
hide_empty (bool) – Whether to hide empty cases in a report
examples_strictness_str (str) – The example strictness string. Defaults to STRICTEST. NOTE: If you set this to ‘None’, examples will be omitted.
jsonpath (Optional[Path]) – The json path to save the report to (if specified)
overwrite (bool) – Whether to overwrite the file if it exists. Defaults to False
jsonindent (int) – The indentation for json objects. Defaults to 0
strictness_str (str) – The strictness name. Defaults to NORMAL.
max_phrase_length (int) – The maximum phrase length in examples. Defaults to 80.
use_mct_export (bool) – Whether to use a MedCATtrainer export as input. Defaults to False.
mct_export_yaml_path (str) – The (optional) path the converted MCT export should be saved as YAML at. If not set (or None), the MCT export is not saved in YAML format. Defaults to None.
only_mct_export_conversion (bool) – Whether to only deal with the MCT export conversion. I.e exit when MCT export conversion is done. Defaults to False.
only_describe (bool) – Whether to only describe the finding options and exit. Defaults to False.
require_fully_correct (bool) – Whether all cases are required to be correct. If set to True, an exit-status of 1 is returned unless all (sub)cases are correct. Defaults to False.
edit_distance (tuple[int, int, int]) – The edit distance, the random seed, and the number of edited names to pick for each of the names. If set to non-0, the specified number of splits, deletes, transposes, replaces, or inserts are done to the each name. This can be useful for looking at the capability of identifying typos in text. However, this can make hte process a lot slower as a result. Defaults to (0, 0, 0).
- Raises:
ValueError – If unable to overwrite file or folder does not exist.
- Return type:
None
- medcat.utils.regression.regression_checker.tuple3_parser(arg)
- Parameters:
arg (str)
- Return type:
tuple[int, int, int]
- medcat.utils.regression.regression_checker.parser