medcat.vocab ============ .. py:module:: medcat.vocab Attributes ---------- .. autoapisummary:: medcat.vocab.WordDescriptor Classes ------- .. autoapisummary:: medcat.vocab.AbstractSerialisable medcat.vocab.AvailableSerialisers medcat.vocab.Vocab Functions --------- .. autoapisummary:: medcat.vocab.deserialise medcat.vocab.serialise Module Contents --------------- .. py:class:: AbstractSerialisable The abstract serialisable base class. This defines some common defaults. .. py:method:: get_strategy() .. py:method:: get_init_attrs() :classmethod: .. py:method:: ignore_attrs() :classmethod: .. py:method:: include_properties() :classmethod: .. py:method:: __eq__(other) .. py:function:: deserialise(folder_path, ignore_folders_prefix = set(), ignore_folders_suffix = set(), **init_kwargs) Deserialise contents of a folder. Extra init keyword arguments can be provided if needed. These are generally: - cnf: The config relevant to the components - tokenizer (BaseTokenizer): The base tokenizer for the model - cdb (CDB): The CDB for the model - vocab (Vocab): The Vocab for the model - model_load_path (Optional[str]): The model load path, but not the component load path This method finds the serialiser to be used based on the files on disk. :param folder_path: The folder to serialise. :type folder_path: str :param ignore_folders_prefix: The prefixes of folders to ignore. :type ignore_folders_prefix: set[str] :param ignore_folders_suffix: The suffixes of folders to ignore. :type ignore_folders_suffix: set[str] :Returns: **Serialisable** -- The deserialised object. .. py:class:: AvailableSerialisers Bases: :py:obj:`enum.Enum` Describes the available serialisers. .. py:attribute:: dill .. py:attribute:: json .. py:method:: write_to(file_path) .. py:method:: from_file(file_path) :classmethod: .. py:method:: __new__(value) .. py:method:: _generate_next_value_(start, count, last_values) Generate the next value when not given. name: the name of the member start: the initial start value or None count: the number of existing members last_value: the last value assigned or None .. py:method:: _missing_(value) :classmethod: .. py:method:: __repr__() .. py:method:: __str__() .. py:method:: __dir__() Returns all members and all public methods .. py:method:: __format__(format_spec) Returns format using actual value type unless __str__ has been overridden. .. py:method:: __hash__() .. py:method:: __reduce_ex__(proto) .. py:method:: name() The name of the Enum member. .. py:method:: value() The value of the Enum member. .. py:function:: serialise(serialiser_type, obj, target_folder, overwrite = False) Serialise an object based on the specified serialiser type. :param serialiser_type: The serialiser type. :type serialiser_type: Union[str, AvailableSerialisers] :param obj: The object to serialise. :type obj: Serialisable :param target_folder: The folder to serialise into. :type target_folder: str :param overwrite: Whether to allow overwriting. Defaults to False. :type overwrite: bool .. py:data:: WordDescriptor .. py:class:: Vocab Bases: :py:obj:`medcat.storage.serialisables.AbstractSerialisable` Vocabulary used to store word embeddings for context similarity calculation. Also used by the spell checker - but not for fixing the spelling only for checking is something correct. Properties: vocab (dict[str, WordDescriptor]): Map from word to attributes, e.g. {'house': {'vector': , 'count': , ...}, ...} index2word (dict[int, str]): From word to an index - used for negative sampling vec_index2word (dict): Same as index2word but only words that have vectors .. py:method:: __init__() .. py:attribute:: vocab :type: dict[str, WordDescriptor] .. py:attribute:: index2word :type: dict[int, str] .. py:attribute:: vec_index2word :type: dict[int, str] .. py:attribute:: cum_probs :type: numpy.ndarray .. py:method:: inc_or_add(word, cnt = 1, vec = None) Add a word or increase its count. :param word: Word to be added :type word: str :param cnt: By how much should the count be increased, or to what should it be set if a new word. (Default value = 1) :type cnt: int :param vec: Word vector (Default value = None) :type vec: Optional[np.ndarray] .. py:method:: remove_all_vectors() Remove all stored vector representations. .. py:method:: remove_words_below_cnt(cnt) Remove all words with frequency below cnt. :param cnt: Word count limit. :type cnt: int .. py:method:: _rebuild_index() .. py:method:: inc_wc(word, cnt = 1) Incraese word count by cnt. :param word: For which word to increase the count :type word: str :param cnt: By how muhc to increase the count (Default value = 1) :type cnt: int .. py:method:: add_vec(word, vec) Add vector to a word. :param word: To which word to add the vector. :type word: str :param vec: The vector to add. :type vec: np.ndarray .. py:method:: reset_counts(cnt = 1) Reset the count for all word to cnt. :param cnt: New count for all words in the vocab. (Default value = 1) :type cnt: int .. py:method:: update_counts(tokens) Given a list of tokens update counts for words in the vocab. :param tokens: Usually a large block of text split into tokens/words. :type tokens: list[str] .. py:method:: add_word(word, cnt = 1, vec = None, replace = True) Add a word to the vocabulary :param word: The word to be added, it should be lemmatized and lowercased :type word: str :param cnt: Count of this word in your dataset (Default value = 1) :type cnt: int :param vec: The vector representation of the word (Default value = None) :type vec: Optional[np.ndarray] :param replace: Will replace old vector representation (Default value = True) :type replace: bool .. py:method:: add_words(path, replace = True) Adds words to the vocab from a file, the file is required to have the following format (vec being optional): [ ] e.g. one line: the word house with 3 dimensional vectors house 34444 0.3232 0.123213 1.231231 :param path: path to the file with words and vectors :type path: str :param replace: existing words in the vocabulary will be replaced. Defaults to True. :type replace: bool .. py:method:: init_cumsums() Initialise cumulative sums. This is in place of the unigram table. But similarly to it, this approach allows generating a list of indices that match the probabilistic distribution expected as per the word counts of each word. .. py:method:: get_negative_samples(n = 6, ignore_punct_and_num = False) Get N negative samples. :param n: How many words to return (Default value = 6) :type n: int :param ignore_punct_and_num: Whether to ignore punctuation and numbers. Defaults to False. :type ignore_punct_and_num: bool :raises Exception: If no unigram table is present. :Returns: **list[int]** -- Indices for words in this vocabulary. .. py:method:: get_vectors(indices) .. py:method:: __getitem__(word) .. py:method:: vec(word) .. py:method:: count(word) .. py:method:: item(word) .. py:method:: __contains__(word) .. py:method:: __eq__(other) .. py:method:: save(save_path, serialiser = AvailableSerialisers.dill, overwrite = False) Save Vocab at path. :param save_path: The path to save at. :type save_path: str :param serialiser: The serialiser. Defaults to AvailableSerialisers.dill. :type serialiser: Union[ str, AvailableSerialisers], optional :param overwrite: Whether to allow overwriting existing files. Defaults to False. :type overwrite: bool, optional .. py:method:: load(path) :classmethod: .. py:method:: get_strategy() .. py:method:: get_init_attrs() :classmethod: .. py:method:: ignore_attrs() :classmethod: .. py:method:: include_properties() :classmethod: