medcat.utils.regression.regression_checker ========================================== .. py:module:: medcat.utils.regression.regression_checker Attributes ---------- .. autoapisummary:: medcat.utils.regression.regression_checker.STRICTNESS_MATRIX medcat.utils.regression.regression_checker.logger medcat.utils.regression.regression_checker.DEFAULT_TEST_SUITE_PATH medcat.utils.regression.regression_checker.parser Classes ------- .. autoapisummary:: medcat.utils.regression.regression_checker.CAT medcat.utils.regression.regression_checker.RegressionSuite medcat.utils.regression.regression_checker.TranslationLayer medcat.utils.regression.regression_checker.Strictness medcat.utils.regression.regression_checker.Finding Functions --------- .. autoapisummary:: medcat.utils.regression.regression_checker.show_description medcat.utils.regression.regression_checker.main medcat.utils.regression.regression_checker.tuple3_parser Module Contents --------------- .. py:class:: CAT(cdb, vocab = None, config = None, model_load_path = None) Bases: :py:obj:`medcat.storage.serialisables.AbstractSerialisable` This is a collection of serialisable model parts. .. py:method:: __init__(cdb, vocab = None, config = None, model_load_path = None) .. py:attribute:: cdb .. py:attribute:: vocab :value: None .. py:attribute:: config :value: None .. py:attribute:: _trainer :type: Optional[medcat.trainer.Trainer] :value: None .. py:attribute:: _pipeline .. py:attribute:: usage_monitor .. py:method:: _recreate_pipe(model_load_path = None) .. py:method:: get_init_attrs() :classmethod: .. py:method:: ignore_attrs() :classmethod: .. py:method:: __call__(text) .. py:method:: _ensure_not_training() Method to ensure config is not set to train. `config.components.linking.train` should only be True while training and not during inference. This aalso corrects the setting if necessary. .. py:method:: get_entities(text: str, only_cui: Literal[False] = False) -> medcat.data.entities.Entities get_entities(text: str, only_cui: Literal[True] = True) -> medcat.data.entities.OnlyCUIEntities get_entities(text: str, only_cui: bool = False) -> Union[dict, medcat.data.entities.Entities, medcat.data.entities.OnlyCUIEntities] Get the entities recognised and linked within the provided text. This will run the text through the pipeline and annotated the recognised and linked entities. :param text: The text to use. :type text: str :param only_cui: Whether to only output the CUIs rather than the entire context. Defaults to False. :type only_cui: bool, optional :Returns: **Union[dict, Entities, OnlyCUIEntities]** -- The entities found and linked within the text. .. py:method:: _mp_worker_func(texts_and_indices) .. py:method:: _generate_batches_by_char_length(text_iter, batch_size_chars, only_cui) .. py:method:: _generate_batches(text_iter, batch_size, batch_size_chars, only_cui) .. py:method:: _generate_simple_batches(text_iter, batch_size, only_cui) .. py:method:: _mp_one_batch_per_process(executor, batch_iter, external_processes) .. py:method:: get_entities_multi_texts(texts, only_cui = False, n_process = 1, batch_size = -1, batch_size_chars = 1000000) Get entities from multiple texts (potentially in parallel). If `n_process` > 1, `n_process - 1` new processes will be created and data will be processed on those as well as the main process in parallel. :param texts: The input text. Either an iterable of raw text or one with in the format of `(text_index, text)`. :type texts: Union[Iterable[str], Iterable[tuple[str, str]]] :param only_cui: Whether to only return CUIs rather than other information like start/end and annotated value. Defaults to False. :type only_cui: bool :param n_process: Number of processes to use. Defaults to 1. :type n_process: int :param batch_size: The number of texts to batch at a time. A batch of the specified size will be given to each worker process. Defaults to -1 and in this case the character count will be used instead. :type batch_size: int :param batch_size_chars: The maximum number of characters to process in a batch. Each process will be given batch of texts with a total number of characters not exceeding this value. Defaults to 1,000,000 characters. Set to -1 to disable. :type batch_size_chars: int :Yields: *Iterator[tuple[str, Union[dict, Entities, OnlyCUIEntities]]]* -- The results in the format of (text_index, entities). .. py:method:: _get_entity(ent, doc_tokens, cui) .. py:method:: get_addon_output(ent) Get the addon output for the entity. This includes a key-value pair for each addon that provides some. Sometimes same-type addons may combine their output under the same key. :param ent: The entity in quesiton. :type ent: MutableEntity :raises ValueError: If unable to merge multiple addon output. :Returns: **dict[str, dict]** -- All the addon output. .. py:method:: _doc_to_out_entity(ent, doc_tokens, only_cui) .. py:method:: _doc_to_out(doc, only_cui, out_with_text = False) .. py:property:: trainer The trainer object. .. py:method:: save_model_pack(target_folder, pack_name = DEFAULT_PACK_NAME, serialiser_type = 'dill', make_archive = True, only_archive = False, add_hash_to_pack_name = True, change_description = None) Save model pack. The resulting model pack name will have the hash of the model pack in its name if (and only if) the default model pack name is used. :param target_folder: The folder to save the pack in. :type target_folder: str :param pack_name: The model pack name. Defaults to DEFAULT_PACK_NAME. :type pack_name: str, optional :param serialiser_type: The serialiser type. Defaults to 'dill'. :type serialiser_type: Union[str, AvailableSerialisers], optional :param make_archive: Whether to make the arhive /.zip file. Defaults to True. :type make_archive: bool :param only_archive: Whether to clear the non-compressed folder. Defaults to False. :type only_archive: bool :param add_hash_to_pack_name: Whether to add the hash to the pack name. This is only relevant if pack_name is specified. Defaults to True. :type add_hash_to_pack_name: bool :param change_description: If provided, this the description will be added to the model description. Defaults to None. :type change_description: Optional[str] :Returns: **str** -- The final model pack path. .. py:method:: _get_hash() .. py:method:: _versioning(change_description) .. py:method:: attempt_unpack(zip_path) :classmethod: Attempt unpack the zip to a folder and get the model pack path. If the folder already exists, no unpacking is done. :param zip_path: The ZIP path :type zip_path: str :Returns: **str** -- The model pack path .. py:method:: load_model_pack(model_pack_path) :classmethod: Load the model pack from file. :param model_pack_path: The model pack path. :type model_pack_path: str :raises ValueError: If the saved data does not represent a model pack. :Returns: **CAT** -- The loaded model pack. .. py:method:: load_cdb(model_pack_path) :classmethod: Loads the concept database from the provided model pack path :param model_pack_path: path to model pack, zip or dir. :type model_pack_path: str :Returns: **CDB** -- The loaded concept database .. py:method:: get_model_card(as_dict: Literal[True]) -> medcat.data.model_card.ModelCard get_model_card(as_dict: Literal[False]) -> str Get the model card either a (nested) `dict` or a json string. :param as_dict: Whether to return as dict. Defaults to False. :type as_dict: bool :Returns: **Union[str, ModelCard]** -- The model card. .. py:method:: __eq__(other) .. py:method:: add_addon(addon) .. py:method:: get_strategy() .. py:method:: include_properties() :classmethod: .. py:class:: RegressionSuite(cases, metadata, name) The regression checker. This is used to check a bunch of regression cases at once against a model. :param cases: The list of regression cases :type cases: list[RegressionCase] :param metadata: The metadata for the regression suite :type metadata: MetaData :param use_report: Whether or not to use the report functionality. Defaults to False. :type use_report: bool .. py:method:: __init__(cases, metadata, name) .. py:attribute:: cases :type: list[RegressionCase] .. py:attribute:: report .. py:attribute:: metadata .. py:method:: get_all_distinct_cases(translation, edit_distance, use_diacritics) Gets all the distinct cases for this regression suite. While distinct cases can be determined without the translation layer, including it here simplifies the process. :param translation: The translation layer. :type translation: TranslationLayer :param edit_distance: The edit distance(s) to try. Defaults to (0, 0, 0). :type edit_distance: tuple[int, int, int] :param use_diacritics: Whether to use diacritics for edit distance. :type use_diacritics: bool :Yields: *Iterator[tuple[RegressionCase, Iterator[FinalTarget]]]* -- The generator of the regression case along with its corresponding sub-cases. .. py:method:: estimate_total_distinct_cases() .. py:method:: iter_subcases(translation, show_progress = True, edit_distance = (0, 0, 0), use_diacritics = False) Iterate over all the sub-cases. Each sub-case present a unique target (phrase, concept, name) on the corresponding regression case. :param translation: The translation layer. :type translation: TranslationLayer :param show_progress: Whether to show progress. Defaults to True. :type show_progress: bool :param edit_distance: The edit distance(s) to try. Defaults to (0, 0, 0). :type edit_distance: tuple[int, int, int] :param use_diacritics: Whether to use diacritics for edit distance. :type use_diacritics: bool :Yields: *Iterator[tuple[RegressionCase, FinalTarget]]* -- The generator of the regression case along with each of the final target sub-cases. .. py:method:: check_model(cat, translation, edit_distance = (0, 0, 0), use_diacritics = False) Checks model and generates a report :param cat: The model to check against :type cat: CAT :param translation: The translation layer :type translation: TranslationLayer :param edit_distance: The edit distance of the names. Defaults to (0, 0, 0). :type edit_distance: tuple[int, int, int] :param use_diacritics: Whether to use diacritics for edit distance. :type use_diacritics: bool :Returns: **MultiDescriptor** -- A report description .. py:method:: __str__() .. py:method:: __repr__() .. py:method:: to_dict() Converts the RegressionChecker to dict for serialisation. :Returns: **dict** -- The dict representation .. py:method:: to_yaml() Convert the RegressionChecker to YAML string. :Returns: **str** -- The YAML representation .. py:method:: __eq__(other) .. py:method:: from_dict(in_dict, name) :classmethod: Construct a RegressionChecker from a dict. Most of the parsing is handled in RegressionChecker.from_dict. This just assumes that each key in the dict is a name and each value describes a RegressionCase. :param in_dict: The input dict. :type in_dict: dict :param name: The name of the regression suite. :type name: str :Returns: **RegressionChecker** -- The built regression checker .. py:method:: from_yaml(file_name) :classmethod: Constructs a RegressionChcker from a YAML file. The from_dict method is used for the construction from the dict. :param file_name: The file name :type file_name: str :Returns: **RegressionChecker** -- The constructed regression checker .. py:method:: from_mct_export(file_name) :classmethod: .. py:class:: TranslationLayer(cui2info, name2info, cui2children, separator, whitespace = ' ') The translation layer for translating: - CUIs to names - names to CUIs - type_ids to CUIs - CUIs to chil CUIs The idea is to decouple these translations from the CDB instance in case something changes there. :param cui2info: The map from CUI to names :type cui2info: dict[str, CUIInfo] :param name2info: The map from name to CUIs :type name2info: dict[str, NameInfo] :param cui2type_ids: The map from CUI to type_ids :type cui2type_ids: dict[str, set[str]] :param cui2children: The map from CUI to child CUIs :type cui2children: dict[str, set[str]] .. py:method:: __init__(cui2info, name2info, cui2children, separator, whitespace = ' ') .. py:attribute:: cui2info .. py:attribute:: name2info .. py:attribute:: separator .. py:attribute:: whitespace :value: ' ' .. py:attribute:: type_id2cuis :type: dict[str, set[str]] .. py:attribute:: cui2children .. py:method:: get_names_of(cui, only_prefnames) Get the preprocessed names of a CUI. This method preporcesses the names by replacing the separator (generally `~`) with the appropriate whitespace (` `). If the concept is not in the underlying CDB, an empty list is returned. :param cui: The concept in question. :type cui: str :param only_prefnames: Whether to only return a preferred name. :type only_prefnames: bool :Returns: **list[str]** -- The list of names. .. py:method:: get_preferred_name(cui) Get the preferred name of a concept. If no preferred name is found, the random 'first' name is selected. :param cui: The concept ID. :type cui: str :Returns: **str** -- The preferred name. .. py:method:: get_first_name(cui) Get the preprocessed (potentially) arbitrarily first name of the given concept. If the concept does not exist, the CUI itself is returned. PS: The "first" name may not be consistent across runs since it relies on set order. :param cui: The concept ID. :type cui: str :Returns: **str** -- The first name. .. py:method:: get_direct_children(cui) Get the direct children of a concept. This means only the children, but not grandchildren. If the underlying CDB doesn't list children for this CUI, an empty list is returned. :param cui: The concept in question. :type cui: str :Returns: **list[str]** -- The (potentially empty) list of direct children. .. py:method:: get_direct_parents(cui) Get the direct parent(s) of a concept. PS: This method can be quite a CPU heavy one since it relies on running through all the parent-children relationships since the child->parent(s) relationship isn't normally kept track of. :param cui: _description_ :type cui: str :Returns: **list[str]** -- _description_ .. py:method:: get_children_of(found_cuis, cui, depth = 1) Get the children of the specifeid CUI in the listed CUIs (if they exist). :param found_cuis: The list of CUIs to look in :type found_cuis: Iterable[str] :param cui: The target parent CUI :type cui: str :param depth: The depth to carry out the search for :type depth: int :Returns: **list[str]** -- The list of children found .. py:method:: from_CDB(cdb) :classmethod: Construct a TranslationLayer object from a context database (CDB). This translation layer will refer to the same dicts that the CDB refers to. While there is no obvious reason these should be modified, it's something to keep in mind. :param cdb: The CDB :type cdb: CDB :Returns: **TranslationLayer** -- The subsequent TranslationLayer .. py:class:: Strictness Bases: :py:obj:`enum.Enum` The total strictness on which to judge the results. .. py:attribute:: STRICTEST The strictest option which only allows identical findings. .. py:attribute:: STRICT A strict option which allows identical or children. .. py:attribute:: NORMAL Normal strictness also allows partial overlaps on target concept and children. .. py:attribute:: LENIENT Lenient stictness also allows parents and grandparents. .. py:attribute:: ANYTHING Anything stricness allows ANY finding. This would generally only be relevant when disabling examples for results descriptors. .. py:method:: __new__(value) .. py:method:: _generate_next_value_(start, count, last_values) Generate the next value when not given. name: the name of the member start: the initial start value or None count: the number of existing members last_value: the last value assigned or None .. py:method:: _missing_(value) :classmethod: .. py:method:: __repr__() .. py:method:: __str__() .. py:method:: __dir__() Returns all members and all public methods .. py:method:: __format__(format_spec) Returns format using actual value type unless __str__ has been overridden. .. py:method:: __hash__() .. py:method:: __reduce_ex__(proto) .. py:method:: name() The name of the Enum member. .. py:method:: value() The value of the Enum member. .. py:class:: Finding Bases: :py:obj:`enum.Enum` Describes whether or how the finding verified. The idea is that we know where we expect the entity to be recognised and the enum constants describe how the recognition compared to the expectation. In essence, we want to know the relative positions of the two pairs of numbers (character numbers): - Expected Start, Expected End - Recognised Start, Recognised End We can model this as 4 numbers on the number line. And we want to know their position relative to each other. For example, if the expected positions are marked with * and recognised positions with #, we may have something like: ___*__#_______#*______________ Which would indicate that there is a partial, but smaller span recognised. .. py:attribute:: IDENTICAL The CUI and the span recognised are identical to what was expected. .. py:attribute:: BIGGER_SPAN_RIGHT The CUI is the same, but the recognised span is longer on the right. If we use the notation from the class doc string, e.g: _*#__*__# .. py:attribute:: BIGGER_SPAN_LEFT The CUI is the same, but the recognised span is longer on the left. If we use the notation from the class doc string, e.g: _#_*__*#_ .. py:attribute:: BIGGER_SPAN_BOTH The CUI is the same, but the recognised span is longer on both sides. If we use the notation from the class doc string, e.g: _#__*__*__#_ .. py:attribute:: SMALLER_SPAN The CUI is the same, but the recognised span is smaller. If we use the notation from the class doc string, e.g: _*_#_#_*_ (neither start nor end match) _*#_#_*__ (start matches, but end is before expected) _*__#_#*_ (end matches, but start is after expected) .. py:attribute:: PARTIAL_OVERLAP The CUI is the same, but the span overlaps partially. If we use the notation from the class doc string, e.g: _*_#__*_#_ (starts between expected start and end, but ends beyond) _#_*_#_*__ (start before expected start, but ends between expected start and end) .. py:attribute:: FOUND_DIR_PARENT The recognised CUI is a parent of the expected CUI but the span is an exact match. .. py:attribute:: FOUND_DIR_GRANDPARENT The recognised CUI is a grandparent of the expected CUI but the span is an exact match. .. py:attribute:: FOUND_ANY_CHILD The recognised CUI is a child of the expected CUI but the span is an exact match. .. py:attribute:: FOUND_CHILD_PARTIAL The recognised CUI is a child yet the match is only partial (smaller/bigger/partial). .. py:attribute:: FOUND_OTHER Found another CUI in the same span. .. py:attribute:: FAIL The concept was not recognised in any meaningful way. .. py:method:: has_correct_cui() Whether the finding found the correct concept. :Returns: **bool** -- Whether the correct concept was found. .. py:method:: determine(exp_cui, exp_start, exp_end, tl, found_entities, strict_only = False, check_children = True, check_parent = True, check_grandparent = True) :classmethod: Determine the finding type based on the input :param exp_cui: Expected CUI. :type exp_cui: str :param exp_start: Expected span start. :type exp_start: int :param exp_end: Expected span end. :type exp_end: int :param tl: The translation layer. :type tl: TranslationLayer :param found_entities: The entities found by the model. :type found_entities: dict[int, Entity] :param strict_only: Whether to use a strict-only mode (either identical or fail). Defaults to False. :type strict_only: bool :param check_children: Whether to check the children. Defaults to True. :type check_children: bool :param check_parent: Whether to check for parent(s). Defaults to True. :type check_parent: bool :param check_grandparent: Whether to check for grandparent(s). Defaults to True. :type check_grandparent: bool :Returns: **tuple['Finding', Optional[str]]** -- The type of finding determined, and the alternative. .. py:method:: __new__(value) .. py:method:: _generate_next_value_(start, count, last_values) Generate the next value when not given. name: the name of the member start: the initial start value or None count: the number of existing members last_value: the last value assigned or None .. py:method:: _missing_(value) :classmethod: .. py:method:: __repr__() .. py:method:: __str__() .. py:method:: __dir__() Returns all members and all public methods .. py:method:: __format__(format_spec) Returns format using actual value type unless __str__ has been overridden. .. py:method:: __hash__() .. py:method:: __reduce_ex__(proto) .. py:method:: name() The name of the Enum member. .. py:method:: value() The value of the Enum member. .. py:data:: STRICTNESS_MATRIX :type: dict[Strictness, set[Finding]] .. py:data:: logger .. py:data:: DEFAULT_TEST_SUITE_PATH .. py:function:: show_description() .. py:function:: main(model_pack_dir, test_suite_file, phrases = False, hide_empty = False, examples_strictness_str = 'STRICTEST', jsonpath = None, overwrite = False, jsonindent = None, strictness_str = 'NORMAL', max_phrase_length = 80, use_mct_export = False, mct_export_yaml_path = None, only_mct_export_conversion = False, only_describe = False, require_fully_correct = False, edit_distance = (0, 0, 0)) Check test suite against the specifeid model pack. :param model_pack_dir: The path to the model pack :type model_pack_dir: Path :param test_suite_file: The path to the test suite YAML :type test_suite_file: Path :param phrases: Whether to show per-phrase information in a report :type phrases: bool :param hide_empty: Whether to hide empty cases in a report :type hide_empty: bool :param examples_strictness_str: The example strictness string. Defaults to STRICTEST. NOTE: If you set this to 'None', examples will be omitted. :type examples_strictness_str: str :param jsonpath: The json path to save the report to (if specified) :type jsonpath: Optional[Path] :param overwrite: Whether to overwrite the file if it exists. Defaults to False :type overwrite: bool :param jsonindent: The indentation for json objects. Defaults to 0 :type jsonindent: int :param strictness_str: The strictness name. Defaults to NORMAL. :type strictness_str: str :param max_phrase_length: The maximum phrase length in examples. Defaults to 80. :type max_phrase_length: int :param use_mct_export: Whether to use a MedCATtrainer export as input. Defaults to False. :type use_mct_export: bool :param mct_export_yaml_path: The (optional) path the converted MCT export should be saved as YAML at. If not set (or None), the MCT export is not saved in YAML format. Defaults to None. :type mct_export_yaml_path: str :param only_mct_export_conversion: Whether to only deal with the MCT export conversion. I.e exit when MCT export conversion is done. Defaults to False. :type only_mct_export_conversion: bool :param only_describe: Whether to only describe the finding options and exit. Defaults to False. :type only_describe: bool :param require_fully_correct: Whether all cases are required to be correct. If set to True, an exit-status of 1 is returned unless all (sub)cases are correct. Defaults to False. :type require_fully_correct: bool :param edit_distance: The edit distance, the random seed, and the number of edited names to pick for each of the names. If set to non-0, the specified number of splits, deletes, transposes, replaces, or inserts are done to the each name. This can be useful for looking at the capability of identifying typos in text. However, this can make hte process a lot slower as a result. Defaults to (0, 0, 0). :type edit_distance: tuple[int, int, int] :raises ValueError: If unable to overwrite file or folder does not exist. .. py:function:: tuple3_parser(arg) .. py:data:: parser