emm package

Subpackages

Submodules

emm.parameters module

Default parameters for Entity Matching.

emm.resources module

emm.resources.data(name)

Return the full path filename of a shipped data file.

Args:

name: The name of the data.

Returns:

The full path filename of the data.

Raises:

FileNotFoundError: If the data cannot be found.

Parameters:

name (str)

Return type:

str

emm.resources.notebook(name)

Return the full path filename of a tutorial notebook.

Args:

name: The name of the notebook.

Returns:

The full path filename of the notebook.

Raises:

FileNotFoundError: If the notebook cannot be found.

Parameters:

name (str)

Return type:

str

emm.version module

Module contents

class emm.PandasEntityMatching(parameters=None, supervised_models=None, name_col=None, entity_id_col=None, name_only=None, preprocessor=None, indexers=None, supervised_on=None, without_rank_features=None, with_legal_entity_forms_match=None, return_sm_features=None, supervised_model_object=None, aggregation_layer=None, aggregation_method=None, carry_on_cols=None, **kwargs)

Bases: BaseEntityMatching

Implementation of EntityMatching using Pandas.

Parameters:
  • parameters (Optional[dict[str, Any]])

  • supervised_models (Optional[Mapping[str, Any]])

  • name_col (Optional[str])

  • entity_id_col (Optional[str])

  • name_only (Optional[bool])

  • preprocessor (Optional[str])

  • indexers (Optional[list])

  • supervised_on (Optional[bool])

  • without_rank_features (Optional[bool])

  • with_legal_entity_forms_match (Optional[bool])

  • return_sm_features (Optional[bool])

  • supervised_model_object (Optional[Pipeline])

  • aggregation_layer (Optional[bool])

  • aggregation_method (Optional[Literal['mean_score', 'max_frequency_nm_score']])

  • carry_on_cols (Optional[list[str]])

add_aggregation_layer(account_col=None, freq_col=None, aggregation_method=None, blacklist=None, aggregation_layer=None)

Add or replace aggregation layer to spark pipeline

Args:

account_col: account_col column indicates which names-to-match belongs together. default is “account”. freq_col: name frequency column, default is “counterparty_account_count_distinct”. aggregation_method: aggregation method: ‘name_clustering’ or ‘mean_score’. Default is ‘name_clustering’. blacklist: blacklist of names to skip in clustering. aggregation_layer: existing aggregation layer to add. Default is None, if so one is created.

Parameters:
  • account_col (Optional[str])

  • freq_col (Optional[str])

  • aggregation_method (Optional[str])

  • blacklist (Optional[list])

  • aggregation_layer (Optional[BaseEntityAggregation])

Return type:

None

add_supervised_model(path=None, model=None, name_only=True, store_key='nm_score', overwrite=True, return_features=None)

Add trained sklearn supervised model to existing pipeline

Args:

path: file path of pickled sklearn pipeline. Or provide model directly. model: trained sklearn pipeline to add to spark supervised layer. name_only: name-only model? If false, presence of extra features (country) is checked. Default is True. store_key: storage key for new sklearn supervised model. default is ‘nm_score’. overwrite: overwrite existing model if store_key already used, default is True. return_features: bool to to return supervised model features. None means default: False.

Parameters:
  • path (Optional[str])

  • model (Optional[Pipeline])

  • name_only (bool)

  • store_key (str)

  • overwrite (bool)

  • return_features (Optional[bool])

Return type:

None

create_training_name_pairs(train_positive_names_to_match, create_negative_sample_fraction=0, n_train_ids=-1, random_seed=42, drop_duplicate_candidates=None, **kwargs)

Create name-pairs for training from positive names that match to the ground truth.

Positive names are names that are supposed to match to the ground truth. A fraction of the positive names can be converted to negative names, which are not supposed to match to the ground truth.

Args:
train_positive_names_to_match: pandas dataframe of positive names to match for training. A positive name

has a guaranteed match to a name in the ground truth. Two columns are needed: a name and id (to determine a corresponding match to the ground truth).

create_negative_sample_fraction: fraction of ids converted to negative names. A negative name has

guaranteed no match to any name in the ground truth. default is 0: no negative names are created.

n_train_ids: down-sample the positive names to match, keep only n_train_ids number of ids.

default value is -1 (keep all).

random_seed: random seed for down-sampling of ids. default is 42. drop_duplicate_candidates: if True drop any duplicate training candidates and keep just one,

if available keep the correct match. Recommended for string-similarity models, eg. with without_rank_features=True. default is False.

kwargs: extra key-word arguments meant to be passed to prepare_name_pairs_pd.

Returns:

pandas dataframe with name-pair candidates to be used for training.

Parameters:
  • train_positive_names_to_match (DataFrame)

  • create_negative_sample_fraction (float)

  • n_train_ids (int)

  • random_seed (int)

  • drop_duplicate_candidates (Optional[bool])

Return type:

DataFrame

decrease_window_by_one_step()

Utility function for negative sample creation during training

This changes the parameter settings of the fitted model.

fit(ground_truth_df, copy_ground_truth=False)

Fits name indexers on ground truth data.

Fit excludes the supervised model, which needs training list of names that match to the ground truth. See instead: cls.fit_classifier().

Args:

ground_truth_df: spark dataframe with ground truth names and corresponding ids. copy_ground_truth: if true, keep a copy of the ground truth, useful for storage of the model.

Returns:

self reference (for compatibility with sklearn models)

Parameters:
  • ground_truth_df (DataFrame)

  • copy_ground_truth (bool)

Return type:

PandasEntityMatching

fit_classifier(train_positive_names_to_match=None, train_name_pairs=None, create_negative_sample_fraction=0, n_train_ids=-1, random_seed=42, train_gt=None, store_key='nm_score', train_function=<function train_model>, score_columns=None, drop_duplicate_candidates=None, extra_features=None, **fit_kws)

Function to train the supervised model based on positive input names.

Positive names are names that are supposed to match to the ground truth. A fraction of the positive names can be converted to negative names, which are not supposed to match to the ground truth.

Args:
train_positive_names_to_match: pandas dataframe of positive names to match for training. A positive name

has a guaranteed match to a name in the ground truth. Two columns are needed: a name and id (to determine a corresponding match to the ground truth).

train_name_pairs: pandas dataframe with training name pair candidates, an alternative to

train_positive_names_to_match. When not provided, train name pairs are created from positive names to match using self.create_training_name_pairs(). default is None (optional.)

create_negative_sample_fraction: fraction of ids converted to negative names. A negative name has

guaranteed no match to any name in the ground truth. default is 0: no negative names are created.

n_train_ids: down-sample the positive names to match, keep only n_train_ids number of ids.

default value is -1 (keep all).

random_seed: random seed for down-sampling of ids. default is 42. train_gt: pandas dataframe of ground truth names and ids for training the indexers. By default we assume

the the indexers have already been fit. default is None (optional).

store_key: storage key for new supervised model. default is ‘nm_score’. train_function: provide custom function to create and train model pipeline. optional. score_columns: list of columns with raw scores from indexers to pass to classifier.

default is None, meaning all indexer scores (e.g. cosine similarity values).

drop_duplicate_candidates: if True drop any duplicate training candidates and keep just one,

if available keep the correct match. Recommended for string-similarity models, eg. with without_rank_features=True. default is False.

extra_features: list of columns (and possibly functions) used for extra features calculation,

e.g. country if name_only=False, default is None. With name_only=False internally extra_features=['country'].

fit_kws: extra kwargs passed on to model fit function. optional.

Returns:

self reference (object including the trained supervised model)

Parameters:
  • train_positive_names_to_match (Optional[DataFrame])

  • create_negative_sample_fraction (float)

  • n_train_ids (int)

  • random_seed (int)

  • train_gt (Optional[DataFrame])

  • drop_duplicate_candidates (Optional[bool])

  • extra_features (Optional[list[str | tuple[str, Callable]]])

Return type:

PandasEntityMatching

increase_window_by_one_step()

Utility function for negative sample creation during training

This changes the parameter settings of the fitted model.

initialize()

If you updated parameters of EntityMatching, you might want to initialize again.

static load(emo_path, load_func=<function load_joblib>, override_parameters=None, name_col=None, entity_id_col=None, **kwargs)

Load the EMM object.

Below are the most common arguments. For complete list see emm.parameters.MODEL_PARAMS. These arguments are optional and update the parameters dictionary.

Args:

emo_path: path to the EMM pickle file. load_func: function used for loading object. default is joblib.load() override_parameters: parameters that overwrite the settings of the EMM object. optional. name_col: name column in dataframe. default is “name”. entity_id_col: id column in dataframe. default is “id”. kwargs: extra key-word arguments are passed on to parameters dictionary.

Returns:

instantiated EMM object

Examples:
>>> # deserialize pickled EMM object and rename name column
>>> em = PandasEntityMatching.load(emo_path, name_col='Name', entity_id_col='Id')
Parameters:
  • emo_path (str)

  • load_func (Callable)

  • override_parameters (Optional[Mapping[str, Any]])

  • name_col (Optional[str])

  • entity_id_col (Optional[str])

Return type:

object

save(emo_path, dump_func=functools.partial(<function dump>, compress=True))

Serialize the EMM object.

Args:

emo_path: path to the EMM pickle file. dump_func: function used for dumping self. default is joblib.dump() with compression turned on.

Parameters:
  • emo_path (str)

  • dump_func (Callable)

set_return_sm_features(return_features=True)

Toggle setting to return supervised model features

Args:

return_features: bool to return supervised model features, default is True.

test_classifier(test_names_to_match, test_gt=None)

Helper function for testing the supervised model.

Print multiple ML model metrics.

Args:

test_names_to_match: test dataframe with names (and ids) to match. test_gt: provide alternative GT. optional, default is None.

Parameters:
  • test_names_to_match (DataFrame)

  • test_gt (Optional[DataFrame])

transform(names_df, top_n=-1)

Matches given names against ground truth.

transform() returns a pandas dataframe with name-pair candidates.

Args:

names_df: dataframe or series with names to be matched. top_n: return top-n candidates per name to match, top-n > 0. -1 returns all candidates. default is -1.

Returns:

dataframe with candidate name-pairs

Parameters:
  • names_df (DataFrame | Series)

  • top_n (int)

Return type:

DataFrame

emm.set_logger(level=20, format='%(asctime)s %(name)-12s %(levelname)-8s %(message)s')