emm.data package¶

Submodules¶

emm.data.create_data module¶

emm.data.create_data.create_example_noised_names(noise_level=0.3, noise_type='all', random_seed=1)¶

Create example noised dataset based on company names from kvk.

The kvk.csv dataset is sample from an open dataset from the Dutch chamber of commerce. open source: https://www.kvk.nl/download/LEI_Full_tcm109-377398.csv the relevant column ‘registeredName’ is already extracted and saved as kvk.csv)

Args:: noise_level: float with probability (0.0 < x < 1.0) of adding noise to a name noise_type: noise type, default is “all” random_seed: seed to use
Returns:: ground_truth and noised names, both pandas dataframes

emm.data.create_data.create_noised_data(spark, noise_level=0.3, noise_type='all', noise_count=1, split_pos_neg=True, data_path=None, name_col='Name', index_col='Index', ret_posneg=False, random_seed=None, positive_set_col='positive_set')¶

Create spark noised dataset based on company names from kvk.

source: https://www.kvk.nl/download/LEI_Full_tcm109-377398.csv the relevant column ‘registeredName’ is already extracted and saved as kvk.csv)

Args:: spark: the spark session noise_level: float with probability (0.0 < x < 1.0) of adding noise to a name noise_type: noise type, default is “all” noise_count: integer number of noised names to create per original name. default is 0. split_pos_neg: randomly split the dataset into positive and negative set data_path: path of input csv file name_col: name column in csv file index_col: name-id column in csv file (optional) ret_posneg: if true also return original positive and negative spark true datasets random_seed: seed to use positive_set_col: name of positive set column in csv file, default is “positive_set”.
Returns:: ground_truth and companies_noised_pd spark dataframes

emm.data.create_data.create_training_data()¶

Return type:: tuple[DataFrame, Vocabulary]

emm.data.create_data.pandas_create_noised_data(noise_level=0.3, noise_type='all', noise_count=1, split_pos_neg=True, data_path=None, name_col='Name', index_col='Index', random_seed=None, positive_set_col='positive_set')¶

Create pandas noised dataset based on company names from kvk.

source: https://www.kvk.nl/download/LEI_Full_tcm109-377398.csv the relevant column ‘registeredName’ is already extracted and saved as kvk.csv)

Args:: noise_level: float with probability (0.0 < x < 1.0) of adding noise to a name noise_type: noise type, default is “all” noise_count: integer number of noised names to create per original name. default is 1. split_pos_neg: randomly split the dataset into positive and negative set data_path: path of input csv file name_col: name column in csv file index_col: name-id column in csv file (optional) random_seed: seed to use positive_set_col: name of positive set column in csv file, default is “positive_set”.
Returns:: ground_truth and companies_noised_pd pandas dataframes

emm.data.create_data.pandas_split_data(data_path=None, name_col='Name', index_col='Index')¶

Split pandas dataset based on duplicate company ids

Args:: data_path: path of input csv file name_col: name column in csv file index_col: name-id column in csv file (optional)
Returns:: ground_truth and negative pandas dataframes

emm.data.create_data.retrieve_kvk_test_sample(url='https://web.archive.org/web/20140225151639if_/http://www.kvk.nl/download/LEI_Full_tcm109-377398.csv', n=6800, random_state=42, store_local=True, ignore_local=False, use_columns=['registeredName', 'legalEntityIdentifier'])¶

Get sample of the complete kvk data for unit testing

For testing and demoing we only need a small subset of the complete kvk dataset. (470kb)

Args:: url: location to download the data from n: number of data records from complete kvk dataset, up to maximum of 6800. default is 6800. random_state: seed to use store_local: store downloaded kvk file locally, default is true. ignore_local: ignore local file, default is false. use_columns: subset of columns to use
Returns:: tuple of path and sample kvk dataframe

Parameters:

url (str)
n (int)
random_state (int)
store_local (bool)
ignore_local (bool)
use_columns (list)

emm.data.create_data.split_data(spark, data_path=None, name_col='Name', index_col='Index')¶

Split dataset into ground truth and negative set based on duplicate company ids

Args:: spark: the spark session data_path: path of input csv file name_col: name column in csv file index_col: name-id column in csv file (optional)
Returns:: ground_truth and negative spark dataframes

emm.data.negative_data_creation module¶

emm.data.negative_data_creation.create_positive_negative_samples(df, uid_col='uid', correct_col='correct', positive_set_col='positive_set', pattern_rank_col='rank_*')¶

Create negative and (consistent) positive datasets from a single positive names dataset

Create a negative name-pairs dataset from a positive name-pairs dataset after it has passed through cosine similarity and/or SNI indexers. Effectively we create a negative names dataset from about half of the input data, where the maximum rank gets reduced by one unit compared with the input positive names dataset. The other half (the positive names) are also reduced in rank-window accordingly.

These are the steps taken for the negative names:

Positive correct name-pairs are removed.
Rerank the remaining candidates of a name-to-match.
Remove any remaining candidates with the highest rank. This is needed in cases where no positive correct pair was present.

Args:

df: input positive names dataframe, which is the output of cosine similarity and/or SNI indexers, from which the negative names dataframe is created. uid_col: name of uid column. default is ‘uid’. correct_col: name of correct-match column. default is ‘correct’. positive_set_col: name of column that indicates which names-to-match go to the positive (and negative)

name pair datasets. default is ‘positive_set’.

pattern_rank_col: pattern used to search for rank columns. Each rank column corresponds to an indexer.: default is the pattern ‘rank_*’.

Returns:

the created, merged negative plus positive name-pairs dataset

Parameters:

df (DataFrame)
uid_col (str)
correct_col (str)
positive_set_col (str)
pattern_rank_col (str)

emm.data.negative_data_creation.merge_indexers(df, indexers, rank_cols)¶

Merging of indexer datasets after the reranking

Args:

df: input positive names dataframe, which is the output of cosine similarity and/or SNI indexers,: from which the negative names dataframe is created.

indexers: indexer datasets after the reranking, will overwrite original input dataset. rank_cols: list with rank columns to overwrite.

Returns:

merged dataset of indexer datasets after the reranking

Parameters:

df (DataFrame)
indexers (list)
rank_cols (list)

emm.data.negative_data_creation.negative_rerank_cossim(indexer_df, rank_col, rank_max, uid_col='uid', correct_col='correct')¶

Reorder the rank column in negative dataset of cosine similarity indexer

Create a negative name-pairs dataset from a positive name-pairs dataset after it has passed through the cosine similarity indexer. Effectively we create a negative names dataset where the maximum rank has been reduced by one unit compared with the positive names dataset. These are the steps taken:

Positive correct name-pairs are removed.
Rerank the remaining candidates of a name-to-match.
Remove any remaining candidates with the highest rank. This is needed in cases where no positive correct pair was present.

Args:

indexer_df: input positive names dataframe, which is the output a cosine similarity indexer,: from which the negative names dataframe is created.

rank_col: name of rank column to reorder. rank_max: only rank values lower than this value are kept, after reranking. uid_col: name of uid column. default is ‘uid’. correct_col: name of correct-match column. default is ‘correct’.

Returns:

the created negative names dataset

Parameters:

rank_col (str)
uid_col (str)
correct_col (str)

emm.data.negative_data_creation.negative_rerank_sni(indexer_df, rank_col, rank_max, uid_col='uid', correct_col='correct')¶

Reorder the rank column in negative dataset of SNI indexer

Create a negative name-pairs dataset from a positive name-pairs dataset after it has passed through the SNI indexer. Effectively we create a negative names dataset where the maximum rank has been reduced by one unit compared with the positive names dataset. These are the steps taken:

Positive correct name-pairs are removed.
Rerank the remaining, relevant SNI candidates of a name-to-match.
Remove any remaining candidates with the highest rank. This is needed in cases where no positive correct pair was present.

Args:: indexer_df: input positive names dataframe, which is the output a SNI indexer, from which the negative names dataframe is created. rank_col: name of rank column to reorder. rank_max: only (absolute) rank values lower than this value are kept, after reranking. uid_col: name of uid column. default is ‘uid’. correct_col: name of correct-match column. default is ‘correct’.
Returns:: the created negative names dataset

emm.data.noiser module¶

class emm.data.noiser.Noiser(insert_vocabulary=None, noise_threshold=0.3, noise_type='all', seed=1)¶

Bases: object

Parameters:

insert_vocabulary (Optional[list[str]])
noise_threshold (float)
noise_type (str)
seed (int)

abbreviate(name)¶

change_letter(word)¶

change_word(name)¶

cut_word(name)¶

drop_letter(word)¶

drop_word(name)¶

insert_letter(word)¶

insert_word(name)¶

merge_words(name)¶

noise(name)¶

split_word(name)¶

swap_letter(word)¶

swap_words(name)¶

emm.data.noiser.create_noiser(names, noise_level, noise_type, random_seed=None)¶: Creates a suitable Noiser class

emm.data.prepare_name_pairs module¶

emm.data.prepare_name_pairs.prepare_name_pairs(candidates, **kwargs)¶

emm.data.prepare_name_pairs.prepare_name_pairs_pd(candidates_pd, drop_duplicate_candidates=False, drop_samename_nomatch=False, create_negative_sample_fraction=0, entity_id_col='entity_id', gt_entity_id_col='gt_entity_id', positive_set_col='positive_set', correct_col='correct', uid_col='uid', gt_uid_col='gt_uid', preprocessed_col='preprocessed', gt_preprocessed_col='gt_preprocessed', random_seed=42)¶

Prepare dataset of name-pair candidates for training of supervised model.

This function is used inside em_model.create_training_name_pairs().

The input are name-pair candidates that are created there, in particular that function creates name-pairs for training from positive names that match to the ground truth.

Positive names are names that are supposed to match to the ground truth. A fraction of the positive names can be converted to negative names, which are not supposed to match to the ground truth.

The creation of negative names drops the negative correct candidates and reranks the remaining negative candidates.

Args:

candidates_pd: input positive name-pair candidates created at em_model.create_training_name_pairs(). drop_duplicate_candidates: if True, drop any duplicate training candidates and keep just one,

if available keep the correct match. Recommended for string-similarity models, eg. with without_rank_features=True. default is False.

drop_samename_nomatch: if True, drop any candidates name-pairs where the two names are equal but which: are not match. default is False.
create_negative_sample_fraction: fraction of name-pairs converted to negative name-pairs. A negative name: has guaranteed no match to any name in the ground truth. default is 0: no negative names are created.
entity_id_col: entity id column of names to match, default is “entity_id”.: For matching name-pairs entity_id == gt_entity_id.
gt_entity_id_col: entity id column of ground-truth names, default is “gt_entity_id”.: For matching name-pairs entity_id == gt_entity_id.
positive_set_col: column that specifies which candidates remain positive and which become negative,: default is “positive_set”.
correct_col: column that indicates a correct match, default is “correct”.: For entity_id == gt_entity_id the column value is “correct”.

uid_col: uid column for names to match, default is “uid”. gt_uid_col: uid column of ground-truth names, default is “gt_uid”. preprocessed_col: name of the preprocessed names column, default is “preprocessed”. gt_preprocessed_col: name of the preprocessed ground-truth names column, default is “gt_preprocessed”. random_seed: random seed for selection of negative names, default is 42.

Module contents¶

emm.data.create_training_data()¶

Return type:: tuple[DataFrame, Vocabulary]