emm package¶

Subpackages¶

Submodules¶

emm.parameters module¶

Default parameters for Entity Matching.

emm.resources module¶

emm.resources.data(name)¶

Return the full path filename of a shipped data file.

Args:: name: The name of the data.
Returns:: The full path filename of the data.
Raises:: FileNotFoundError: If the data cannot be found.

Parameters:: name (str)
Return type:: str

emm.resources.notebook(name)¶

Return the full path filename of a tutorial notebook.

Args:: name: The name of the notebook.
Returns:: The full path filename of the notebook.
Raises:: FileNotFoundError: If the notebook cannot be found.

Parameters:: name (str)
Return type:: str

emm.version module¶

Module contents¶

class emm.PandasEntityMatching(parameters=None, supervised_models=None, name_col=None, entity_id_col=None, name_only=None, preprocessor=None, indexers=None, supervised_on=None, without_rank_features=None, with_legal_entity_forms_match=None, return_sm_features=None, supervised_model_object=None, aggregation_layer=None, aggregation_method=None, carry_on_cols=None, **kwargs)¶

Bases: BaseEntityMatching

Implementation of EntityMatching using Pandas.

Parameters:

parameters (Optional[dict[str, Any]])
supervised_models (Optional[Mapping[str, Any]])
name_col (Optional[str])
entity_id_col (Optional[str])
name_only (Optional[bool])
preprocessor (Optional[str])
indexers (Optional[list])
supervised_on (Optional[bool])
without_rank_features (Optional[bool])
with_legal_entity_forms_match (Optional[bool])
return_sm_features (Optional[bool])
supervised_model_object (Optional[Pipeline])
aggregation_layer (Optional[bool])
aggregation_method (Optional[Literal['mean_score', 'max_frequency_nm_score']])
carry_on_cols (Optional[list[str]])

add_aggregation_layer(account_col=None, freq_col=None, aggregation_method=None, blacklist=None, aggregation_layer=None)¶

Add or replace aggregation layer to spark pipeline

Args:: account_col: account_col column indicates which names-to-match belongs together. default is “account”. freq_col: name frequency column, default is “counterparty_account_count_distinct”. aggregation_method: aggregation method: ‘name_clustering’ or ‘mean_score’. Default is ‘name_clustering’. blacklist: blacklist of names to skip in clustering. aggregation_layer: existing aggregation layer to add. Default is None, if so one is created.

Parameters:

account_col (Optional[str])
freq_col (Optional[str])
aggregation_method (Optional[str])
blacklist (Optional[list])
aggregation_layer (Optional[BaseEntityAggregation])

Return type:

None

add_supervised_model(path=None, model=None, name_only=True, store_key='nm_score', overwrite=True, return_features=None)¶

Add trained sklearn supervised model to existing pipeline

Args:: path: file path of pickled sklearn pipeline. Or provide model directly. model: trained sklearn pipeline to add to spark supervised layer. name_only: name-only model? If false, presence of extra features (country) is checked. Default is True. store_key: storage key for new sklearn supervised model. default is ‘nm_score’. overwrite: overwrite existing model if store_key already used, default is True. return_features: bool to to return supervised model features. None means default: False.

Parameters:

path (Optional[str])
model (Optional[Pipeline])
name_only (bool)
store_key (str)
overwrite (bool)
return_features (Optional[bool])

Return type:

None

create_training_name_pairs(train_positive_names_to_match, create_negative_sample_fraction=0, n_train_ids=-1, random_seed=42, drop_duplicate_candidates=None, **kwargs)¶

Create name-pairs for training from positive names that match to the ground truth.

Positive names are names that are supposed to match to the ground truth. A fraction of the positive names can be converted to negative names, which are not supposed to match to the ground truth.

Args:

train_positive_names_to_match: pandas dataframe of positive names to match for training. A positive name: has a guaranteed match to a name in the ground truth. Two columns are needed: a name and id (to determine a corresponding match to the ground truth).
create_negative_sample_fraction: fraction of ids converted to negative names. A negative name has: guaranteed no match to any name in the ground truth. default is 0: no negative names are created.
n_train_ids: down-sample the positive names to match, keep only n_train_ids number of ids.: default value is -1 (keep all).

random_seed: random seed for down-sampling of ids. default is 42. drop_duplicate_candidates: if True drop any duplicate training candidates and keep just one,

if available keep the correct match. Recommended for string-similarity models, eg. with without_rank_features=True. default is False.

kwargs: extra key-word arguments meant to be passed to prepare_name_pairs_pd.

Returns:

pandas dataframe with name-pair candidates to be used for training.

Parameters:

train_positive_names_to_match (DataFrame)
create_negative_sample_fraction (float)
n_train_ids (int)
random_seed (int)
drop_duplicate_candidates (Optional[bool])

Return type:

DataFrame

decrease_window_by_one_step()¶

Utility function for negative sample creation during training

This changes the parameter settings of the fitted model.

fit(ground_truth_df, copy_ground_truth=False)¶

Fits name indexers on ground truth data.

Fit excludes the supervised model, which needs training list of names that match to the ground truth. See instead: cls.fit_classifier().

Args:: ground_truth_df: spark dataframe with ground truth names and corresponding ids. copy_ground_truth: if true, keep a copy of the ground truth, useful for storage of the model.
Returns:: self reference (for compatibility with sklearn models)

Parameters:

ground_truth_df (DataFrame)
copy_ground_truth (bool)

Return type:

PandasEntityMatching

fit_classifier(train_positive_names_to_match=None, train_name_pairs=None, create_negative_sample_fraction=0, n_train_ids=-1, random_seed=42, train_gt=None, store_key='nm_score', train_function=<function train_model>, score_columns=None, drop_duplicate_candidates=None, extra_features=None, **fit_kws)¶

Function to train the supervised model based on positive input names.

Positive names are names that are supposed to match to the ground truth. A fraction of the positive names can be converted to negative names, which are not supposed to match to the ground truth.

Args:

train_positive_names_to_match: pandas dataframe of positive names to match for training. A positive name: has a guaranteed match to a name in the ground truth. Two columns are needed: a name and id (to determine a corresponding match to the ground truth).
train_name_pairs: pandas dataframe with training name pair candidates, an alternative to: train_positive_names_to_match. When not provided, train name pairs are created from positive names to match using self.create_training_name_pairs(). default is None (optional.)
create_negative_sample_fraction: fraction of ids converted to negative names. A negative name has: guaranteed no match to any name in the ground truth. default is 0: no negative names are created.
n_train_ids: down-sample the positive names to match, keep only n_train_ids number of ids.: default value is -1 (keep all).

random_seed: random seed for down-sampling of ids. default is 42. train_gt: pandas dataframe of ground truth names and ids for training the indexers. By default we assume

the the indexers have already been fit. default is None (optional).

store_key: storage key for new supervised model. default is ‘nm_score’. train_function: provide custom function to create and train model pipeline. optional. score_columns: list of columns with raw scores from indexers to pass to classifier.

default is None, meaning all indexer scores (e.g. cosine similarity values).

drop_duplicate_candidates: if True drop any duplicate training candidates and keep just one,: if available keep the correct match. Recommended for string-similarity models, eg. with without_rank_features=True. default is False.
extra_features: list of columns (and possibly functions) used for extra features calculation,: e.g. country if name_only=False, default is None. With name_only=False internally extra_features=['country'].

fit_kws: extra kwargs passed on to model fit function. optional.

Returns:

self reference (object including the trained supervised model)

Parameters:

train_positive_names_to_match (Optional[DataFrame])
create_negative_sample_fraction (float)
n_train_ids (int)
random_seed (int)
train_gt (Optional[DataFrame])
drop_duplicate_candidates (Optional[bool])
extra_features (Optional[list[str | tuple[str, Callable]]])

Return type:

PandasEntityMatching

increase_window_by_one_step()¶

Utility function for negative sample creation during training

This changes the parameter settings of the fitted model.

initialize()¶: If you updated parameters of EntityMatching, you might want to initialize again.

static load(emo_path, load_func=<function load_joblib>, override_parameters=None, name_col=None, entity_id_col=None, **kwargs)¶

Load the EMM object.

Below are the most common arguments. For complete list see emm.parameters.MODEL_PARAMS. These arguments are optional and update the parameters dictionary.

Args:

emo_path: path to the EMM pickle file. load_func: function used for loading object. default is joblib.load() override_parameters: parameters that overwrite the settings of the EMM object. optional. name_col: name column in dataframe. default is “name”. entity_id_col: id column in dataframe. default is “id”. kwargs: extra key-word arguments are passed on to parameters dictionary.

Returns:

instantiated EMM object

Examples:

>>> # deserialize pickled EMM object and rename name column
>>> em = PandasEntityMatching.load(emo_path, name_col='Name', entity_id_col='Id')

Parameters:

emo_path (str)
load_func (Callable)
override_parameters (Optional[Mapping[str, Any]])
name_col (Optional[str])
entity_id_col (Optional[str])

Return type:

object

save(emo_path, dump_func=functools.partial(<function dump>, compress=True))¶

Serialize the EMM object.

Args:: emo_path: path to the EMM pickle file. dump_func: function used for dumping self. default is joblib.dump() with compression turned on.

Parameters:

emo_path (str)
dump_func (Callable)

set_return_sm_features(return_features=True)¶

Toggle setting to return supervised model features

Args:: return_features: bool to return supervised model features, default is True.

test_classifier(test_names_to_match, test_gt=None)¶

Helper function for testing the supervised model.

Print multiple ML model metrics.

Args:: test_names_to_match: test dataframe with names (and ids) to match. test_gt: provide alternative GT. optional, default is None.

Parameters:

test_names_to_match (DataFrame)
test_gt (Optional[DataFrame])

transform(names_df, top_n=-1)¶

Matches given names against ground truth.

transform() returns a pandas dataframe with name-pair candidates.

Args:: names_df: dataframe or series with names to be matched. top_n: return top-n candidates per name to match, top-n > 0. -1 returns all candidates. default is -1.
Returns:: dataframe with candidate name-pairs

Parameters:

names_df (DataFrame | Series)
top_n (int)

Return type:

DataFrame

emm.set_logger(level=20, format='%(asctime)s %(name)-12s %(levelname)-8s %(message)s')¶