medcat.storage.serialisers

Attributes

DEFAULT_SCHEMA_FILE

logger

SER_TYPE_FILE

MANUAL_SERIALISED_TAG

MANUAL_SERIALISED_RE

_DEF_SER

Exceptions

IllegalSchemaException

Inappropriate argument value (of correct type).

Classes

Serialisable

The base serialisable protocol.

ManualSerialisable

The base serialisable protocol.

RemappingUnpickler

python's Unpickler extended to interpreter sessions and more types

Serialiser

The abstract serialiser base class.

AvailableSerialisers

Describes the available serialisers.

DillSerialiser

The dill based serialiser.

Functions

get_all_serialisable_members(object)

Gets all serialisable members of an object.

load_schema(file_name)

Loads the schema for a folder of deserialisable files from the file.

save_schema(file_name, cls, init_parts)

Saves the schema of a class to the specified file.

fix_module_and_cls_name(module_name, cls_name)

get_serialiser([serialiser_type])

Get the serialiser based on the type specified.

get_serialiser_type_from_folder(folder_path)

Get the serialiser type that was used to serialise data in the folder.

get_serialiser_from_folder(folder_path)

Get the serialiser that was used to serialise the data in the folder.

serialise(serialiser_type, obj, target_folder[, overwrite])

Serialise an object based on the specified serialiser type.

deserialise(folder_path[, ignore_folders_prefix, ...])

Deserialise contents of a folder.

Module Contents

class medcat.storage.serialisers.Serialisable

Bases: Protocol

The base serialisable protocol.

get_strategy()

Get the serialisation strategy.

Returns:

SerialisingStrategy – The strategy.

Return type:

SerialisingStrategy

classmethod get_init_attrs()

Get the names of the arguments needed for init upon deserialisation.

Returns:

list[str] – The list of init arguments’ names.

Return type:

list[str]

classmethod ignore_attrs()

Get the names of attributes not to serialise.

Returns:

list[str] – The attribute names that should not be serialised.

Return type:

list[str]

classmethod include_properties()
Return type:

list[str]

__slots__ = ()
_is_protocol = True
_is_runtime_protocol = False
classmethod __init_subclass__(*args, **kwargs)
classmethod __class_getitem__(params)
class medcat.storage.serialisers.ManualSerialisable

Bases: Serialisable, Protocol

The base serialisable protocol.

serialise_to(folder_path)

Serialise to a folder.

Parameters:

folder_path (str) – The folder to serialise to.

Return type:

None

classmethod deserialise_from(folder_path, **init_kwargs)

Deserialise from a specifc path.

The init keyword arguments are generally: - cnf: The config relevant to the components - tokenizer (BaseTokenizer): The base tokenizer for the model - cdb (CDB): The CDB for the model - vocab (Vocab): The Vocab for the model - model_load_path (Optional[str]): The model load path,

but not the component load path

Parameters:

folder_path (str) – The path to deserialsie form.

Returns:

ManualSerialisable – The deserialised object.

Return type:

ManualSerialisable

get_strategy()

Get the serialisation strategy.

Returns:

SerialisingStrategy – The strategy.

Return type:

SerialisingStrategy

classmethod get_init_attrs()

Get the names of the arguments needed for init upon deserialisation.

Returns:

list[str] – The list of init arguments’ names.

Return type:

list[str]

classmethod ignore_attrs()

Get the names of attributes not to serialise.

Returns:

list[str] – The attribute names that should not be serialised.

Return type:

list[str]

classmethod include_properties()
Return type:

list[str]

__slots__ = ()
_is_protocol = True
_is_runtime_protocol = False
classmethod __init_subclass__(*args, **kwargs)
classmethod __class_getitem__(params)
medcat.storage.serialisers.get_all_serialisable_members(object)

Gets all serialisable members of an object.

This looks for public and protected members, but not private ones. It should also be able to return parts of lists and tuples. It also provides the name of each serialisable object.

Parameters:

object (Any) – The target object.

Returns:

tuple[list[tuple[Serialisable, str]], dict[str, Any]] – list of serialisable objects along with their names

Return type:

tuple[list[tuple[Serialisable, str]], dict[str, Any]]

medcat.storage.serialisers.load_schema(file_name)

Loads the schema for a folder of deserialisable files from the file.

Parameters:

file_name (str) – The schema file

Returns:

tuple[str, list[str]] – The class package/name along with the parts needed for initialising.

Return type:

tuple[str, list[str]]

medcat.storage.serialisers.save_schema(file_name, cls, init_parts)

Saves the schema of a class to the specified file.

Parameters:
  • file_name (str) – The file to save to.

  • cls (Type) – The class in question

  • list[str] (init_parts) – The parts of the .

  • init_parts (list[str])

Return type:

None

medcat.storage.serialisers.DEFAULT_SCHEMA_FILE = '.schema.json'
exception medcat.storage.serialisers.IllegalSchemaException(*args)

Bases: ValueError

Inappropriate argument value (of correct type).

__init__(*args)

Initialize self. See help(type(self)) for accurate signature.

class __cause__

exception cause

class __context__

exception context

__delattr__()

Implement delattr(self, name).

__dir__()

Default dir() implementation.

__eq__()

Return self==value.

__format__()

Default object formatter.

__ge__()

Return self>=value.

__getattribute__()

Return getattr(self, name).

__gt__()

Return self>value.

__hash__()

Return hash(self).

__le__()

Return self<=value.

__lt__()

Return self<value.

__ne__()

Return self!=value.

__new__()

Create and return a new object. See help(type) for accurate signature.

__reduce__()
__reduce_ex__()

Helper for pickle.

__repr__()

Return repr(self).

__setattr__()

Implement setattr(self, name, value).

__setstate__()
__sizeof__()

Size of object in memory, in bytes.

__str__()

Return str(self).

__subclasshook__()

Abstract classes can override this to customize issubclass().

This is invoked early on by abc.ABCMeta.__subclasscheck__(). It should return True, False or NotImplemented. If it returns NotImplemented, the normal algorithm is used. Otherwise, it overrides the normal algorithm (and the outcome is cached).

class __suppress_context__
class __traceback__
class args
with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

medcat.storage.serialisers.fix_module_and_cls_name(module_name, cls_name)
Parameters:
  • module_name (str)

  • cls_name (str)

Return type:

tuple[str, str]

class medcat.storage.serialisers.RemappingUnpickler(*args, **kwds)

Bases: dill.Unpickler

python’s Unpickler extended to interpreter sessions and more types

find_class(module, name)

Return an object from a specified module.

If necessary, the module will be imported. Subclasses may override this method (e.g. to restrict unpickling of arbitrary classes and functions).

This method is called whenever a class or a function object is needed. Both arguments passed are str objects.

Parameters:
  • module (str)

  • name (str)

_session = False
__init__(*args, **kwds)

Initialize self. See help(type(self)) for accurate signature.

_main
_ignore = False
load()

Load a pickle.

Read a pickled object representation from the open file object given in the constructor, and return the reconstituted object hierarchy specified therein.

__delattr__()

Implement delattr(self, name).

__dir__()

Default dir() implementation.

__eq__()

Return self==value.

__format__()

Default object formatter.

__ge__()

Return self>=value.

__getattribute__()

Return getattr(self, name).

__gt__()

Return self>value.

__hash__()

Return hash(self).

__le__()

Return self<=value.

__lt__()

Return self<value.

__ne__()

Return self!=value.

__reduce__()

Helper for pickle.

__reduce_ex__()

Helper for pickle.

__repr__()

Return repr(self).

__setattr__()

Implement setattr(self, name, value).

__sizeof__()

Returns size in memory, in bytes.

__str__()

Return str(self).

class memo
class persistent_load
_buffers
_file_readline
_file_read
encoding = 'ASCII'
errors = 'strict'
proto = 0
fix_imports = True
pop_mark()
dispatch
load_proto()
load_frame()
load_persid()
load_binpersid()
load_none()
load_false()
load_true()
load_int()
load_binint()
load_binint1()
load_binint2()
load_long()
load_long1()
load_long4()
load_float()
load_binfloat()
_decode_string(value)
load_string()
load_binstring()
load_binbytes()
load_unicode()
load_binunicode()
load_binunicode8()
load_binbytes8()
load_bytearray8()
load_next_buffer()
load_readonly_buffer()
load_short_binstring()
load_short_binbytes()
load_short_binunicode()
load_tuple()
load_empty_tuple()
load_tuple1()
load_tuple2()
load_tuple3()
load_empty_list()
load_empty_dictionary()
load_empty_set()
load_frozenset()
load_list()
load_dict()
_instantiate(klass, args)
load_inst()
load_obj()
load_newobj()
load_newobj_ex()
load_global()
load_stack_global()
load_ext1()
load_ext2()
load_ext4()
get_extension(code)
load_reduce()
load_pop()
load_pop_mark()
load_dup()
load_get()
load_binget()
load_long_binget()
load_put()
load_binput()
load_long_binput()
load_memoize()
load_append()
load_appends()
load_setitem()
load_setitems()
load_additems()
load_build()
load_mark()
load_stop()
medcat.storage.serialisers.logger
medcat.storage.serialisers.SER_TYPE_FILE = '.serialised_by'
medcat.storage.serialisers.MANUAL_SERIALISED_TAG = 'MANUALLY_SERIALISED:'
medcat.storage.serialisers.MANUAL_SERIALISED_RE
class medcat.storage.serialisers.Serialiser

Bases: abc.ABC

The abstract serialiser base class.

This class is responsible for both serialising and deserialising.

RAW_FILE = 'raw_dict.dat'
property ser_type: AvailableSerialisers
Abstractmethod:

Return type:

AvailableSerialisers

The serialiser type.

abstract serialise(raw_parts, target_file)

Serialise the raw attributes / objects.

Parameters:
  • raw_parts (dict[str, Any]) – The raw objects to serialise.

  • target_file (str) – The file name to write to.

Return type:

None

abstract deserialise(target_file)

Deserialise data written to the specified file.

Parameters:

target_file (str) – The file to read from.

Returns:

dict[str, Any] – The deserialised raw attributes / objects.

Return type:

dict[str, Any]

classmethod get_ser_type_file(folder)
Parameters:

folder (str)

Return type:

str

save_ser_type_file(folder)

Save the serialiser type into the specified folder.

Parameters:

folder (str) – The folder to use.

Return type:

None

classmethod get_manually_serialised_path(folder)
Parameters:

folder (str)

Return type:

Optional[str]

check_ser_type(folder)

Check that the folder contains data serialised by this serialiser.

Parameters:

folder (str) – Target folder.

Raises:

TypeError – If the folder was not serialised by this serialiser.

Return type:

None

serialise_all(obj, target_folder, overwrite=False)

Serialise the entire object into the target folder.

This finds the serialisable parts (attributes) of the object and calls the same method on them recursively. It also finds the raw attributes (if any) and serialises them.

Parameters:
  • obj (Serialisable) – The object to serialise.

  • target_folder (str) – The target folder.

  • overwrite (bool) – Whether to allow overwriting. Defaults to False.

Raises:

IllegalSchemaException – If there’s multiple parts with the same name or a file already exists.

Return type:

None

classmethod deserialise_manually(folder_path, man_cls_path, **init_kwargs)
Parameters:
  • folder_path (str)

  • man_cls_path (str)

Return type:

medcat.storage.serialisables.Serialisable

deserialise_all(folder_path, ignore_folders_prefix=set(), ignore_folders_suffix=set(), **kwargs)

Deserialise contents of folder.

Additional initialisation keyword arguments can be provided if needed.

This loads both the raw attributes for this object as well as the serialisable parts / attributes recursively.

Parameters:
  • folder_path (str) – The folder path.

  • ignore_folders_prefix (set[str]) – The prefixes of folders to ignore.

  • ignore_folders_suffix (set[str]) – The suffixes of folders to ignore.

Returns:

Serialisable – The resulting object.

Return type:

medcat.storage.serialisables.Serialisable

__slots__ = ()
class medcat.storage.serialisers.AvailableSerialisers

Bases: enum.Enum

Describes the available serialisers.

dill
json
write_to(file_path)
Parameters:

file_path (str)

Return type:

None

classmethod from_file(file_path)
Parameters:

file_path (str)

Return type:

AvailableSerialisers

__new__(value)
_generate_next_value_(start, count, last_values)

Generate the next value when not given.

name: the name of the member start: the initial start value or None count: the number of existing members last_value: the last value assigned or None

classmethod _missing_(value)
__repr__()
__str__()
__dir__()

Returns all members and all public methods

__format__(format_spec)

Returns format using actual value type unless __str__ has been overridden.

__hash__()
__reduce_ex__(proto)
name()

The name of the Enum member.

value()

The value of the Enum member.

class medcat.storage.serialisers.DillSerialiser

Bases: Serialiser

The dill based serialiser.

ser_type

The serialiser type.

serialise(raw_parts, target_file)

Serialise the raw attributes / objects.

Parameters:
  • raw_parts (dict[str, Any]) – The raw objects to serialise.

  • target_file (str) – The file name to write to.

Return type:

None

deserialise(target_file)

Deserialise data written to the specified file.

Parameters:

target_file (str) – The file to read from.

Returns:

dict[str, Any] – The deserialised raw attributes / objects.

Return type:

dict[str, Any]

RAW_FILE = 'raw_dict.dat'
classmethod get_ser_type_file(folder)
Parameters:

folder (str)

Return type:

str

save_ser_type_file(folder)

Save the serialiser type into the specified folder.

Parameters:

folder (str) – The folder to use.

Return type:

None

classmethod get_manually_serialised_path(folder)
Parameters:

folder (str)

Return type:

Optional[str]

check_ser_type(folder)

Check that the folder contains data serialised by this serialiser.

Parameters:

folder (str) – Target folder.

Raises:

TypeError – If the folder was not serialised by this serialiser.

Return type:

None

serialise_all(obj, target_folder, overwrite=False)

Serialise the entire object into the target folder.

This finds the serialisable parts (attributes) of the object and calls the same method on them recursively. It also finds the raw attributes (if any) and serialises them.

Parameters:
  • obj (Serialisable) – The object to serialise.

  • target_folder (str) – The target folder.

  • overwrite (bool) – Whether to allow overwriting. Defaults to False.

Raises:

IllegalSchemaException – If there’s multiple parts with the same name or a file already exists.

Return type:

None

classmethod deserialise_manually(folder_path, man_cls_path, **init_kwargs)
Parameters:
  • folder_path (str)

  • man_cls_path (str)

Return type:

medcat.storage.serialisables.Serialisable

deserialise_all(folder_path, ignore_folders_prefix=set(), ignore_folders_suffix=set(), **kwargs)

Deserialise contents of folder.

Additional initialisation keyword arguments can be provided if needed.

This loads both the raw attributes for this object as well as the serialisable parts / attributes recursively.

Parameters:
  • folder_path (str) – The folder path.

  • ignore_folders_prefix (set[str]) – The prefixes of folders to ignore.

  • ignore_folders_suffix (set[str]) – The suffixes of folders to ignore.

Returns:

Serialisable – The resulting object.

Return type:

medcat.storage.serialisables.Serialisable

__slots__ = ()
medcat.storage.serialisers._DEF_SER
medcat.storage.serialisers.get_serialiser(serialiser_type=_DEF_SER)

Get the serialiser based on the type specified.

Parameters:

serialiser_type (Union[str, AvailableSerialisers], optional) – The required type. Defaults to ‘dill’.

Raises:

ValueError – If no serialiser is found.

Returns:

Serialiser – The appropriate serialiser.

Return type:

Serialiser

medcat.storage.serialisers.get_serialiser_type_from_folder(folder_path)

Get the serialiser type that was used to serialise data in the folder.

Parameters:

folder_path (str) – The folder in question.

Returns:

AvailableSerialisers – The serialiser type.

Return type:

AvailableSerialisers

medcat.storage.serialisers.get_serialiser_from_folder(folder_path)

Get the serialiser that was used to serialise the data in the folder.

Parameters:

folder_path (str) – The folder in question.

Returns:

Serialiser – The appropriate serialiser.

Return type:

Serialiser

medcat.storage.serialisers.serialise(serialiser_type, obj, target_folder, overwrite=False)

Serialise an object based on the specified serialiser type.

Parameters:
  • serialiser_type (Union[str, AvailableSerialisers]) – The serialiser type.

  • obj (Serialisable) – The object to serialise.

  • target_folder (str) – The folder to serialise into.

  • overwrite (bool) – Whether to allow overwriting. Defaults to False.

Return type:

None

medcat.storage.serialisers.deserialise(folder_path, ignore_folders_prefix=set(), ignore_folders_suffix=set(), **init_kwargs)

Deserialise contents of a folder.

Extra init keyword arguments can be provided if needed. These are generally: - cnf: The config relevant to the components - tokenizer (BaseTokenizer): The base tokenizer for the model - cdb (CDB): The CDB for the model - vocab (Vocab): The Vocab for the model - model_load_path (Optional[str]): The model load path,

but not the component load path

This method finds the serialiser to be used based on the files on disk.

Parameters:
  • folder_path (str) – The folder to serialise.

  • ignore_folders_prefix (set[str]) – The prefixes of folders to ignore.

  • ignore_folders_suffix (set[str]) – The suffixes of folders to ignore.

Returns:

Serialisable – The deserialised object.

Return type:

medcat.storage.serialisables.Serialisable