Breaking changes compared to v1

There’s a number of breaking changes to the API compared to v1. This will attempt to list them all. If something was missed, don’t hesitate to create PR with the addition. Though do note, that only the major API-level changes will be listed.

API changes to CAT

Training

Training is now separated from the main CAT class into its own class (Trainer) and module (trainer.py). This affects the following methods (assumption is that cat is an instance of CAT):

v1 method

v2 method

cat.train

cat.trainer.train_unsupervised

cat.train_supervised_raw

cat.trainer.train_supervised_raw

Model saving

v1 method

v2 method

cat.create_model_pack

cat.save_model_pack

Removals

These methods were removed either due to a difference in approach or due to preceived unimportance. Protected (starting with _) or private (starting with __) methods won’t be recorded here. If you were previously relying on some of the behaviour provided by these, don’t hesitate to get in touch.

v1 method

Reason removed

cat.train_supervised_from_json

Don’t want to be tightly coupled to a file format here

cat.multiprocessing_batch_char_size

There is currently only one multiprocessing method

cat.multiprocessing_batch_docs_size

and that is CAT.get_entities_multi_texts

cat.get_json

Unclear usecases

def destroy_pipe

Unclear usecases

API Changes to CDB

The CDB class is now located in medcat.cdb.cdb module. However, it can be imported from the package directly as well, same as before (from medcat.cdb import CDB).

Names and CUIs are now mapped to variables differently

Instead of cui2<stuff> and name2stuff dicts, v2 provides cui2info and name2info mappings. Either of these have a dict that defines per concept or name information. Below you can see how to access the same things in the new version.

v1 method

v2 method

Notes

cdb.cui2names[cui]

cdb.cui2info[cui]['names']

cdb.cui2snames[cui]

cdb.cui2info[cui]['subnames']

cdb.cui2count_train[cui]

cdb.cui2info[cui]['count_train']

cdb.cui2context_vectors[cui]

cdb.cui2info[cui]['context_vectors']

cdb.cui2type_ids[cui]

cdb.cui2info[cui]['type_ids']

cdb.cui2preferred_name[cui]

cdb.cui2info[cui]['preferred_name']

cdb.cui2average_confidence[cui]

cdb.cui2info[cui]['average_confidence']

cdb.name2cuis[name]

cdb.name2info[name]['per_cui_status'].keys()

There’s no need to track per CUI status (on a per name basis) and per name CUIs separately

cdb.name2cuis2status[name]

cdb.name2info[name]['per_cui_status']

cdb.name2count_train[name]

cdb.name2info[name]['count_train']

cdb.snames

cdb._subnames

cdb.make_stats()

cdb.get_basic_info()

API changes for Config

Some config parts have been moved around for clarity. The below is the list of config parts that have been relocated. It must be noted that the ability to use config[path] = value was also removed.

v1 location

v2 location

Notes

config.linking

config.components.linking

config.ner

config.components.ner

config.ner

config.components.ner

Relocated packages / modules

Some packages and modules were relocated. We can see the list of relocations here.

v1 location

v2 location

Notes

medcat.meta_cat

medcat.components.addons.meta_cat.meta_cat

medcat.utils.meta_cat

medcat.components.addons.meta_cat

medcat.config_meta_cat

medcat.config.config_meta_cat

medcat.cdb_maker

medcat.model_creation.cdb_maker

medcat.tokenizers.meta_cat_tokenizers

medcat.components.addons.meta_cat.mctokenizers.tokenizers

All MetACAT stuff now here

medcat.rel_cat

medcat.components.addons.relation_extraction.rel_cat

All RelCAT stuff now here

medcat.utils.relation_extraction.*

medcat.components.addons.relation_extraction.*

medcat.utils.ner.deid

medcat.components.ner.trf.deid

Most DeID stuff now here

medcat.utils.ner.model

medcat.components.ner.trf.model

medcat.utils.ner.helpers

medcat.components.ner.trf.helpers

medcat.tokenizer.transformers_ner

medcat.components.ner.trf.tokenizer

medcat.ner.transformers_ner

medcat.components.ner.tf.transformers_ner

medcat.datasets.transformers_ner

medcat.utils.ner.transformers_ner

medcat.datasets.data_collator

medcat.utils.ner.data_collator