emm.pipeline package¶

Submodules¶

emm.pipeline.base_entity_matching module¶

class emm.pipeline.base_entity_matching.BaseEntityMatching(parameters=None, supervised_models=None)¶

Bases: Pipeline, ABC

Base implementation of EntityMatching

Parameters:

parameters (Optional[dict])
supervised_models (Optional[dict[str, Any]])

calc_threshold(agg_name, type_name, metric_name, min_value, threshold_parameters=None)¶

Calculate threshold score for given metric with minimum metric value

Args:: agg_name: name of aggregation method, see get_threshold_agg_name(). type_name: “positive” or “negative” names or “all” (positive and negative). metric_name: name of metric, eg. “precision”, “TNR”, “TPR”, “fullrecall”, “predicted_matches_rate”. min_value: minimum value for the metric. threshold_parameters: dict with threshold curves. use threshold.get_threshold_curves_parameters()

if not provided, try to get this from self.parameters.
Returns:: threshold score

get_model_title()¶

Construct model title from parameters settings

Extract experimental title of model based on model’s settings: indexer, sm, aggregation. E.g. can be used for storage.

static get_threshold_agg_name(aggregation_layer=False, aggregation_method='name_clustering')¶

Helper function for getting/setting aggregation method name

Args:: aggregation_layer: use aggregation layer? default is False. aggregation_method: which aggregation method is used? ‘name_clustering’ or ‘mean_score’.
Returns:: ‘non_aggregated’ if aggregation_layer is False else aggregation_method.

set_threshold(type_name, metric_name, min_value, agg_name=None, threshold_parameters=None)¶

Calculate and set threshold score for given metric with minimum metric value

Args:: type_name: “positive” names or “all” (positive and negative). metric_name: name of metric, eg. “precision”, “TNR”, “TPR”, “fullrecall”, “predicted_matches_rate”. min_value: minimum value for the metric. agg_name: name of aggregation method, if None take from self.get_threshold_agg_name(). threshold_parameters: dict with threshold curves. use threshold.get_threshold_curves_parameters()

if not provided, try to get this from self.parameters.

static version()¶

emm.pipeline.pandas_entity_matching module¶

class emm.pipeline.pandas_entity_matching.PandasEntityMatching(parameters=None, supervised_models=None, name_col=None, entity_id_col=None, name_only=None, preprocessor=None, indexers=None, supervised_on=None, without_rank_features=None, with_legal_entity_forms_match=None, return_sm_features=None, supervised_model_object=None, aggregation_layer=None, aggregation_method=None, carry_on_cols=None, **kwargs)¶

Bases: BaseEntityMatching

Implementation of EntityMatching using Pandas.

Parameters:

parameters (Optional[dict[str, Any]])
supervised_models (Optional[Mapping[str, Any]])
name_col (Optional[str])
entity_id_col (Optional[str])
name_only (Optional[bool])
preprocessor (Optional[str])
indexers (Optional[list])
supervised_on (Optional[bool])
without_rank_features (Optional[bool])
with_legal_entity_forms_match (Optional[bool])
return_sm_features (Optional[bool])
supervised_model_object (Optional[Pipeline])
aggregation_layer (Optional[bool])
aggregation_method (Optional[Literal['mean_score', 'max_frequency_nm_score']])
carry_on_cols (Optional[list[str]])

add_aggregation_layer(account_col=None, freq_col=None, aggregation_method=None, blacklist=None, aggregation_layer=None)¶

Add or replace aggregation layer to spark pipeline

Args:: account_col: account_col column indicates which names-to-match belongs together. default is “account”. freq_col: name frequency column, default is “counterparty_account_count_distinct”. aggregation_method: aggregation method: ‘name_clustering’ or ‘mean_score’. Default is ‘name_clustering’. blacklist: blacklist of names to skip in clustering. aggregation_layer: existing aggregation layer to add. Default is None, if so one is created.

Parameters:

account_col (Optional[str])
freq_col (Optional[str])
aggregation_method (Optional[str])
blacklist (Optional[list])
aggregation_layer (Optional[BaseEntityAggregation])

Return type:

None

add_supervised_model(path=None, model=None, name_only=True, store_key='nm_score', overwrite=True, return_features=None)¶

Add trained sklearn supervised model to existing pipeline

Args:: path: file path of pickled sklearn pipeline. Or provide model directly. model: trained sklearn pipeline to add to spark supervised layer. name_only: name-only model? If false, presence of extra features (country) is checked. Default is True. store_key: storage key for new sklearn supervised model. default is ‘nm_score’. overwrite: overwrite existing model if store_key already used, default is True. return_features: bool to to return supervised model features. None means default: False.

Parameters:

path (Optional[str])
model (Optional[Pipeline])
name_only (bool)
store_key (str)
overwrite (bool)
return_features (Optional[bool])

Return type:

None

create_training_name_pairs(train_positive_names_to_match, create_negative_sample_fraction=0, n_train_ids=-1, random_seed=42, drop_duplicate_candidates=None, **kwargs)¶

Create name-pairs for training from positive names that match to the ground truth.

Positive names are names that are supposed to match to the ground truth. A fraction of the positive names can be converted to negative names, which are not supposed to match to the ground truth.

Args:

train_positive_names_to_match: pandas dataframe of positive names to match for training. A positive name: has a guaranteed match to a name in the ground truth. Two columns are needed: a name and id (to determine a corresponding match to the ground truth).
create_negative_sample_fraction: fraction of ids converted to negative names. A negative name has: guaranteed no match to any name in the ground truth. default is 0: no negative names are created.
n_train_ids: down-sample the positive names to match, keep only n_train_ids number of ids.: default value is -1 (keep all).

random_seed: random seed for down-sampling of ids. default is 42. drop_duplicate_candidates: if True drop any duplicate training candidates and keep just one,

if available keep the correct match. Recommended for string-similarity models, eg. with without_rank_features=True. default is False.

kwargs: extra key-word arguments meant to be passed to prepare_name_pairs_pd.

Returns:

pandas dataframe with name-pair candidates to be used for training.

Parameters:

train_positive_names_to_match (DataFrame)
create_negative_sample_fraction (float)
n_train_ids (int)
random_seed (int)
drop_duplicate_candidates (Optional[bool])

Return type:

DataFrame

decrease_window_by_one_step()¶

Utility function for negative sample creation during training

This changes the parameter settings of the fitted model.

fit(ground_truth_df, copy_ground_truth=False)¶

Fits name indexers on ground truth data.

Fit excludes the supervised model, which needs training list of names that match to the ground truth. See instead: cls.fit_classifier().

Args:: ground_truth_df: spark dataframe with ground truth names and corresponding ids. copy_ground_truth: if true, keep a copy of the ground truth, useful for storage of the model.
Returns:: self reference (for compatibility with sklearn models)

Parameters:

ground_truth_df (DataFrame)
copy_ground_truth (bool)

Return type:

PandasEntityMatching

fit_classifier(train_positive_names_to_match=None, train_name_pairs=None, create_negative_sample_fraction=0, n_train_ids=-1, random_seed=42, train_gt=None, store_key='nm_score', train_function=<function train_model>, score_columns=None, drop_duplicate_candidates=None, extra_features=None, **fit_kws)¶

Function to train the supervised model based on positive input names.

Positive names are names that are supposed to match to the ground truth. A fraction of the positive names can be converted to negative names, which are not supposed to match to the ground truth.

Args:

train_positive_names_to_match: pandas dataframe of positive names to match for training. A positive name: has a guaranteed match to a name in the ground truth. Two columns are needed: a name and id (to determine a corresponding match to the ground truth).
train_name_pairs: pandas dataframe with training name pair candidates, an alternative to: train_positive_names_to_match. When not provided, train name pairs are created from positive names to match using self.create_training_name_pairs(). default is None (optional.)
create_negative_sample_fraction: fraction of ids converted to negative names. A negative name has: guaranteed no match to any name in the ground truth. default is 0: no negative names are created.
n_train_ids: down-sample the positive names to match, keep only n_train_ids number of ids.: default value is -1 (keep all).

random_seed: random seed for down-sampling of ids. default is 42. train_gt: pandas dataframe of ground truth names and ids for training the indexers. By default we assume

the the indexers have already been fit. default is None (optional).

store_key: storage key for new supervised model. default is ‘nm_score’. train_function: provide custom function to create and train model pipeline. optional. score_columns: list of columns with raw scores from indexers to pass to classifier.

default is None, meaning all indexer scores (e.g. cosine similarity values).

drop_duplicate_candidates: if True drop any duplicate training candidates and keep just one,: if available keep the correct match. Recommended for string-similarity models, eg. with without_rank_features=True. default is False.
extra_features: list of columns (and possibly functions) used for extra features calculation,: e.g. country if name_only=False, default is None. With name_only=False internally extra_features=['country'].

fit_kws: extra kwargs passed on to model fit function. optional.

Returns:

self reference (object including the trained supervised model)

Parameters:

train_positive_names_to_match (Optional[DataFrame])
create_negative_sample_fraction (float)
n_train_ids (int)
random_seed (int)
train_gt (Optional[DataFrame])
drop_duplicate_candidates (Optional[bool])
extra_features (Optional[list[str | tuple[str, Callable]]])

Return type:

PandasEntityMatching

increase_window_by_one_step()¶

Utility function for negative sample creation during training

This changes the parameter settings of the fitted model.

initialize()¶: If you updated parameters of EntityMatching, you might want to initialize again.

static load(emo_path, load_func=<function load_joblib>, override_parameters=None, name_col=None, entity_id_col=None, **kwargs)¶

Load the EMM object.

Below are the most common arguments. For complete list see emm.parameters.MODEL_PARAMS. These arguments are optional and update the parameters dictionary.

Args:

emo_path: path to the EMM pickle file. load_func: function used for loading object. default is joblib.load() override_parameters: parameters that overwrite the settings of the EMM object. optional. name_col: name column in dataframe. default is “name”. entity_id_col: id column in dataframe. default is “id”. kwargs: extra key-word arguments are passed on to parameters dictionary.

Returns:

instantiated EMM object

Examples:

>>> # deserialize pickled EMM object and rename name column
>>> em = PandasEntityMatching.load(emo_path, name_col='Name', entity_id_col='Id')

Parameters:

emo_path (str)
load_func (Callable)
override_parameters (Optional[Mapping[str, Any]])
name_col (Optional[str])
entity_id_col (Optional[str])

Return type:

object

save(emo_path, dump_func=functools.partial(<function dump>, compress=True))¶

Serialize the EMM object.

Args:: emo_path: path to the EMM pickle file. dump_func: function used for dumping self. default is joblib.dump() with compression turned on.

Parameters:

emo_path (str)
dump_func (Callable)

set_return_sm_features(return_features=True)¶

Toggle setting to return supervised model features

Args:: return_features: bool to return supervised model features, default is True.

test_classifier(test_names_to_match, test_gt=None)¶

Helper function for testing the supervised model.

Print multiple ML model metrics.

Args:: test_names_to_match: test dataframe with names (and ids) to match. test_gt: provide alternative GT. optional, default is None.

Parameters:

test_names_to_match (DataFrame)
test_gt (Optional[DataFrame])

transform(names_df, top_n=-1)¶

Matches given names against ground truth.

transform() returns a pandas dataframe with name-pair candidates.

Args:: names_df: dataframe or series with names to be matched. top_n: return top-n candidates per name to match, top-n > 0. -1 returns all candidates. default is -1.
Returns:: dataframe with candidate name-pairs

Parameters:

names_df (DataFrame | Series)
top_n (int)

Return type:

DataFrame

emm.pipeline.spark_entity_matching module¶

Module contents¶

class emm.pipeline.PandasEntityMatching(parameters=None, supervised_models=None, name_col=None, entity_id_col=None, name_only=None, preprocessor=None, indexers=None, supervised_on=None, without_rank_features=None, with_legal_entity_forms_match=None, return_sm_features=None, supervised_model_object=None, aggregation_layer=None, aggregation_method=None, carry_on_cols=None, **kwargs)¶

Bases: BaseEntityMatching

Implementation of EntityMatching using Pandas.

Parameters:

parameters (Optional[dict[str, Any]])
supervised_models (Optional[Mapping[str, Any]])
name_col (Optional[str])
entity_id_col (Optional[str])
name_only (Optional[bool])
preprocessor (Optional[str])
indexers (Optional[list])
supervised_on (Optional[bool])
without_rank_features (Optional[bool])
with_legal_entity_forms_match (Optional[bool])
return_sm_features (Optional[bool])
supervised_model_object (Optional[Pipeline])
aggregation_layer (Optional[bool])
aggregation_method (Optional[Literal['mean_score', 'max_frequency_nm_score']])
carry_on_cols (Optional[list[str]])

add_aggregation_layer(account_col=None, freq_col=None, aggregation_method=None, blacklist=None, aggregation_layer=None)¶

Add or replace aggregation layer to spark pipeline

Args:: account_col: account_col column indicates which names-to-match belongs together. default is “account”. freq_col: name frequency column, default is “counterparty_account_count_distinct”. aggregation_method: aggregation method: ‘name_clustering’ or ‘mean_score’. Default is ‘name_clustering’. blacklist: blacklist of names to skip in clustering. aggregation_layer: existing aggregation layer to add. Default is None, if so one is created.

Parameters:

account_col (Optional[str])
freq_col (Optional[str])
aggregation_method (Optional[str])
blacklist (Optional[list])
aggregation_layer (Optional[BaseEntityAggregation])

Return type:

None

add_supervised_model(path=None, model=None, name_only=True, store_key='nm_score', overwrite=True, return_features=None)¶

Add trained sklearn supervised model to existing pipeline

Args:: path: file path of pickled sklearn pipeline. Or provide model directly. model: trained sklearn pipeline to add to spark supervised layer. name_only: name-only model? If false, presence of extra features (country) is checked. Default is True. store_key: storage key for new sklearn supervised model. default is ‘nm_score’. overwrite: overwrite existing model if store_key already used, default is True. return_features: bool to to return supervised model features. None means default: False.

Parameters:

path (Optional[str])
model (Optional[Pipeline])
name_only (bool)
store_key (str)
overwrite (bool)
return_features (Optional[bool])

Return type:

None

create_training_name_pairs(train_positive_names_to_match, create_negative_sample_fraction=0, n_train_ids=-1, random_seed=42, drop_duplicate_candidates=None, **kwargs)¶

Create name-pairs for training from positive names that match to the ground truth.

Positive names are names that are supposed to match to the ground truth. A fraction of the positive names can be converted to negative names, which are not supposed to match to the ground truth.

Args:

train_positive_names_to_match: pandas dataframe of positive names to match for training. A positive name: has a guaranteed match to a name in the ground truth. Two columns are needed: a name and id (to determine a corresponding match to the ground truth).
create_negative_sample_fraction: fraction of ids converted to negative names. A negative name has: guaranteed no match to any name in the ground truth. default is 0: no negative names are created.
n_train_ids: down-sample the positive names to match, keep only n_train_ids number of ids.: default value is -1 (keep all).

random_seed: random seed for down-sampling of ids. default is 42. drop_duplicate_candidates: if True drop any duplicate training candidates and keep just one,

if available keep the correct match. Recommended for string-similarity models, eg. with without_rank_features=True. default is False.

kwargs: extra key-word arguments meant to be passed to prepare_name_pairs_pd.

Returns:

pandas dataframe with name-pair candidates to be used for training.

Parameters:

train_positive_names_to_match (DataFrame)
create_negative_sample_fraction (float)
n_train_ids (int)
random_seed (int)
drop_duplicate_candidates (Optional[bool])

Return type:

DataFrame

decrease_window_by_one_step()¶

Utility function for negative sample creation during training

This changes the parameter settings of the fitted model.

fit(ground_truth_df, copy_ground_truth=False)¶

Fits name indexers on ground truth data.

Fit excludes the supervised model, which needs training list of names that match to the ground truth. See instead: cls.fit_classifier().

Args:: ground_truth_df: spark dataframe with ground truth names and corresponding ids. copy_ground_truth: if true, keep a copy of the ground truth, useful for storage of the model.
Returns:: self reference (for compatibility with sklearn models)

Parameters:

ground_truth_df (DataFrame)
copy_ground_truth (bool)

Return type:

PandasEntityMatching

Function to train the supervised model based on positive input names.

Positive names are names that are supposed to match to the ground truth. A fraction of the positive names can be converted to negative names, which are not supposed to match to the ground truth.

Args:

train_positive_names_to_match: pandas dataframe of positive names to match for training. A positive name: has a guaranteed match to a name in the ground truth. Two columns are needed: a name and id (to determine a corresponding match to the ground truth).
train_name_pairs: pandas dataframe with training name pair candidates, an alternative to: train_positive_names_to_match. When not provided, train name pairs are created from positive names to match using self.create_training_name_pairs(). default is None (optional.)
create_negative_sample_fraction: fraction of ids converted to negative names. A negative name has: guaranteed no match to any name in the ground truth. default is 0: no negative names are created.
n_train_ids: down-sample the positive names to match, keep only n_train_ids number of ids.: default value is -1 (keep all).

random_seed: random seed for down-sampling of ids. default is 42. train_gt: pandas dataframe of ground truth names and ids for training the indexers. By default we assume

the the indexers have already been fit. default is None (optional).

default is None, meaning all indexer scores (e.g. cosine similarity values).

drop_duplicate_candidates: if True drop any duplicate training candidates and keep just one,: if available keep the correct match. Recommended for string-similarity models, eg. with without_rank_features=True. default is False.
extra_features: list of columns (and possibly functions) used for extra features calculation,: e.g. country if name_only=False, default is None. With name_only=False internally extra_features=['country'].

fit_kws: extra kwargs passed on to model fit function. optional.

Returns:

self reference (object including the trained supervised model)

Parameters:

train_positive_names_to_match (Optional[DataFrame])
create_negative_sample_fraction (float)
n_train_ids (int)
random_seed (int)
train_gt (Optional[DataFrame])
drop_duplicate_candidates (Optional[bool])
extra_features (Optional[list[str | tuple[str, Callable]]])

Return type:

PandasEntityMatching

increase_window_by_one_step()¶

Utility function for negative sample creation during training

This changes the parameter settings of the fitted model.

initialize()¶: If you updated parameters of EntityMatching, you might want to initialize again.

static load(emo_path, load_func=<function load_joblib>, override_parameters=None, name_col=None, entity_id_col=None, **kwargs)¶

Load the EMM object.

Below are the most common arguments. For complete list see emm.parameters.MODEL_PARAMS. These arguments are optional and update the parameters dictionary.

Args:

Returns:

instantiated EMM object

Examples:

>>> # deserialize pickled EMM object and rename name column
>>> em = PandasEntityMatching.load(emo_path, name_col='Name', entity_id_col='Id')

Parameters:

emo_path (str)
load_func (Callable)
override_parameters (Optional[Mapping[str, Any]])
name_col (Optional[str])
entity_id_col (Optional[str])

Return type:

object

save(emo_path, dump_func=functools.partial(<function dump>, compress=True))¶

Serialize the EMM object.

Args:: emo_path: path to the EMM pickle file. dump_func: function used for dumping self. default is joblib.dump() with compression turned on.

Parameters:

emo_path (str)
dump_func (Callable)

set_return_sm_features(return_features=True)¶

Toggle setting to return supervised model features

Args:: return_features: bool to return supervised model features, default is True.

test_classifier(test_names_to_match, test_gt=None)¶

Helper function for testing the supervised model.

Print multiple ML model metrics.

Args:: test_names_to_match: test dataframe with names (and ids) to match. test_gt: provide alternative GT. optional, default is None.

Parameters:

test_names_to_match (DataFrame)
test_gt (Optional[DataFrame])

transform(names_df, top_n=-1)¶

Matches given names against ground truth.

transform() returns a pandas dataframe with name-pair candidates.

Args:: names_df: dataframe or series with names to be matched. top_n: return top-n candidates per name to match, top-n > 0. -1 returns all candidates. default is -1.
Returns:: dataframe with candidate name-pairs

Parameters:

names_df (DataFrame | Series)
top_n (int)

Return type:

DataFrame