emm.pipeline package¶
Submodules¶
emm.pipeline.base_entity_matching module¶
- class emm.pipeline.base_entity_matching.BaseEntityMatching(parameters=None, supervised_models=None)¶
Bases:
Pipeline,ABCBase implementation of EntityMatching
- Parameters:
parameters (
Optional[dict])supervised_models (
Optional[dict[str,Any]])
- calc_threshold(agg_name, type_name, metric_name, min_value, threshold_parameters=None)¶
Calculate threshold score for given metric with minimum metric value
- Args:
agg_name: name of aggregation method, see get_threshold_agg_name(). type_name: “positive” or “negative” names or “all” (positive and negative). metric_name: name of metric, eg. “precision”, “TNR”, “TPR”, “fullrecall”, “predicted_matches_rate”. min_value: minimum value for the metric. threshold_parameters: dict with threshold curves. use threshold.get_threshold_curves_parameters()
if not provided, try to get this from self.parameters.
- Returns:
threshold score
- get_model_title()¶
Construct model title from parameters settings
Extract experimental title of model based on model’s settings: indexer, sm, aggregation. E.g. can be used for storage.
- static get_threshold_agg_name(aggregation_layer=False, aggregation_method='name_clustering')¶
Helper function for getting/setting aggregation method name
- Args:
aggregation_layer: use aggregation layer? default is False. aggregation_method: which aggregation method is used? ‘name_clustering’ or ‘mean_score’.
- Returns:
‘non_aggregated’ if aggregation_layer is False else aggregation_method.
- set_threshold(type_name, metric_name, min_value, agg_name=None, threshold_parameters=None)¶
Calculate and set threshold score for given metric with minimum metric value
- Args:
type_name: “positive” names or “all” (positive and negative). metric_name: name of metric, eg. “precision”, “TNR”, “TPR”, “fullrecall”, “predicted_matches_rate”. min_value: minimum value for the metric. agg_name: name of aggregation method, if None take from self.get_threshold_agg_name(). threshold_parameters: dict with threshold curves. use threshold.get_threshold_curves_parameters()
if not provided, try to get this from self.parameters.
- static version()¶
emm.pipeline.pandas_entity_matching module¶
- class emm.pipeline.pandas_entity_matching.PandasEntityMatching(parameters=None, supervised_models=None, name_col=None, entity_id_col=None, name_only=None, preprocessor=None, indexers=None, supervised_on=None, without_rank_features=None, with_legal_entity_forms_match=None, return_sm_features=None, supervised_model_object=None, aggregation_layer=None, aggregation_method=None, carry_on_cols=None, **kwargs)¶
Bases:
BaseEntityMatchingImplementation of EntityMatching using Pandas.
- Parameters:
parameters (
Optional[dict[str,Any]])supervised_models (
Optional[Mapping[str,Any]])name_col (
Optional[str])entity_id_col (
Optional[str])name_only (
Optional[bool])preprocessor (
Optional[str])indexers (
Optional[list])supervised_on (
Optional[bool])without_rank_features (
Optional[bool])with_legal_entity_forms_match (
Optional[bool])return_sm_features (
Optional[bool])supervised_model_object (
Optional[Pipeline])aggregation_layer (
Optional[bool])aggregation_method (
Optional[Literal['mean_score','max_frequency_nm_score']])carry_on_cols (
Optional[list[str]])
- add_aggregation_layer(account_col=None, freq_col=None, aggregation_method=None, blacklist=None, aggregation_layer=None)¶
Add or replace aggregation layer to spark pipeline
- Args:
account_col: account_col column indicates which names-to-match belongs together. default is “account”. freq_col: name frequency column, default is “counterparty_account_count_distinct”. aggregation_method: aggregation method: ‘name_clustering’ or ‘mean_score’. Default is ‘name_clustering’. blacklist: blacklist of names to skip in clustering. aggregation_layer: existing aggregation layer to add. Default is None, if so one is created.
- Parameters:
account_col (
Optional[str])freq_col (
Optional[str])aggregation_method (
Optional[str])blacklist (
Optional[list])aggregation_layer (
Optional[BaseEntityAggregation])
- Return type:
None
- add_supervised_model(path=None, model=None, name_only=True, store_key='nm_score', overwrite=True, return_features=None)¶
Add trained sklearn supervised model to existing pipeline
- Args:
path: file path of pickled sklearn pipeline. Or provide model directly. model: trained sklearn pipeline to add to spark supervised layer. name_only: name-only model? If false, presence of extra features (country) is checked. Default is True. store_key: storage key for new sklearn supervised model. default is ‘nm_score’. overwrite: overwrite existing model if store_key already used, default is True. return_features: bool to to return supervised model features. None means default: False.
- Parameters:
path (
Optional[str])model (
Optional[Pipeline])name_only (
bool)store_key (
str)overwrite (
bool)return_features (
Optional[bool])
- Return type:
None
- create_training_name_pairs(train_positive_names_to_match, create_negative_sample_fraction=0, n_train_ids=-1, random_seed=42, drop_duplicate_candidates=None, **kwargs)¶
Create name-pairs for training from positive names that match to the ground truth.
Positive names are names that are supposed to match to the ground truth. A fraction of the positive names can be converted to negative names, which are not supposed to match to the ground truth.
- Args:
- train_positive_names_to_match: pandas dataframe of positive names to match for training. A positive name
has a guaranteed match to a name in the ground truth. Two columns are needed: a name and id (to determine a corresponding match to the ground truth).
- create_negative_sample_fraction: fraction of ids converted to negative names. A negative name has
guaranteed no match to any name in the ground truth. default is 0: no negative names are created.
- n_train_ids: down-sample the positive names to match, keep only n_train_ids number of ids.
default value is -1 (keep all).
random_seed: random seed for down-sampling of ids. default is 42. drop_duplicate_candidates: if True drop any duplicate training candidates and keep just one,
if available keep the correct match. Recommended for string-similarity models, eg. with without_rank_features=True. default is False.
kwargs: extra key-word arguments meant to be passed to prepare_name_pairs_pd.
- Returns:
pandas dataframe with name-pair candidates to be used for training.
- Parameters:
train_positive_names_to_match (
DataFrame)create_negative_sample_fraction (
float)n_train_ids (
int)random_seed (
int)drop_duplicate_candidates (
Optional[bool])
- Return type:
DataFrame
- decrease_window_by_one_step()¶
Utility function for negative sample creation during training
This changes the parameter settings of the fitted model.
- fit(ground_truth_df, copy_ground_truth=False)¶
Fits name indexers on ground truth data.
Fit excludes the supervised model, which needs training list of names that match to the ground truth. See instead: cls.fit_classifier().
- Args:
ground_truth_df: spark dataframe with ground truth names and corresponding ids. copy_ground_truth: if true, keep a copy of the ground truth, useful for storage of the model.
- Returns:
self reference (for compatibility with sklearn models)
- Parameters:
ground_truth_df (
DataFrame)copy_ground_truth (
bool)
- Return type:
- fit_classifier(train_positive_names_to_match=None, train_name_pairs=None, create_negative_sample_fraction=0, n_train_ids=-1, random_seed=42, train_gt=None, store_key='nm_score', train_function=<function train_model>, score_columns=None, drop_duplicate_candidates=None, extra_features=None, **fit_kws)¶
Function to train the supervised model based on positive input names.
Positive names are names that are supposed to match to the ground truth. A fraction of the positive names can be converted to negative names, which are not supposed to match to the ground truth.
- Args:
- train_positive_names_to_match: pandas dataframe of positive names to match for training. A positive name
has a guaranteed match to a name in the ground truth. Two columns are needed: a name and id (to determine a corresponding match to the ground truth).
- train_name_pairs: pandas dataframe with training name pair candidates, an alternative to
train_positive_names_to_match. When not provided, train name pairs are created from positive names to match using self.create_training_name_pairs(). default is None (optional.)
- create_negative_sample_fraction: fraction of ids converted to negative names. A negative name has
guaranteed no match to any name in the ground truth. default is 0: no negative names are created.
- n_train_ids: down-sample the positive names to match, keep only n_train_ids number of ids.
default value is -1 (keep all).
random_seed: random seed for down-sampling of ids. default is 42. train_gt: pandas dataframe of ground truth names and ids for training the indexers. By default we assume
the the indexers have already been fit. default is None (optional).
store_key: storage key for new supervised model. default is ‘nm_score’. train_function: provide custom function to create and train model pipeline. optional. score_columns: list of columns with raw scores from indexers to pass to classifier.
default is None, meaning all indexer scores (e.g. cosine similarity values).
- drop_duplicate_candidates: if True drop any duplicate training candidates and keep just one,
if available keep the correct match. Recommended for string-similarity models, eg. with without_rank_features=True. default is False.
- extra_features: list of columns (and possibly functions) used for extra features calculation,
e.g. country if name_only=False, default is None. With
name_only=Falseinternallyextra_features=['country'].
fit_kws: extra kwargs passed on to model fit function. optional.
- Returns:
self reference (object including the trained supervised model)
- Parameters:
train_positive_names_to_match (
Optional[DataFrame])create_negative_sample_fraction (
float)n_train_ids (
int)random_seed (
int)train_gt (
Optional[DataFrame])drop_duplicate_candidates (
Optional[bool])extra_features (
Optional[list[str|tuple[str,Callable]]])
- Return type:
- increase_window_by_one_step()¶
Utility function for negative sample creation during training
This changes the parameter settings of the fitted model.
- initialize()¶
If you updated parameters of EntityMatching, you might want to initialize again.
- static load(emo_path, load_func=<function load_joblib>, override_parameters=None, name_col=None, entity_id_col=None, **kwargs)¶
Load the EMM object.
Below are the most common arguments. For complete list see emm.parameters.MODEL_PARAMS. These arguments are optional and update the parameters dictionary.
- Args:
emo_path: path to the EMM pickle file. load_func: function used for loading object. default is joblib.load() override_parameters: parameters that overwrite the settings of the EMM object. optional. name_col: name column in dataframe. default is “name”. entity_id_col: id column in dataframe. default is “id”. kwargs: extra key-word arguments are passed on to parameters dictionary.
- Returns:
instantiated EMM object
- Examples:
>>> # deserialize pickled EMM object and rename name column >>> em = PandasEntityMatching.load(emo_path, name_col='Name', entity_id_col='Id')
- Parameters:
emo_path (
str)load_func (
Callable)override_parameters (
Optional[Mapping[str,Any]])name_col (
Optional[str])entity_id_col (
Optional[str])
- Return type:
object
- save(emo_path, dump_func=functools.partial(<function dump>, compress=True))¶
Serialize the EMM object.
- Args:
emo_path: path to the EMM pickle file. dump_func: function used for dumping self. default is joblib.dump() with compression turned on.
- Parameters:
emo_path (
str)dump_func (
Callable)
- set_return_sm_features(return_features=True)¶
Toggle setting to return supervised model features
- Args:
return_features: bool to return supervised model features, default is True.
- test_classifier(test_names_to_match, test_gt=None)¶
Helper function for testing the supervised model.
Print multiple ML model metrics.
- Args:
test_names_to_match: test dataframe with names (and ids) to match. test_gt: provide alternative GT. optional, default is None.
- Parameters:
test_names_to_match (
DataFrame)test_gt (
Optional[DataFrame])
- transform(names_df, top_n=-1)¶
Matches given names against ground truth.
transform() returns a pandas dataframe with name-pair candidates.
- Args:
names_df: dataframe or series with names to be matched. top_n: return top-n candidates per name to match, top-n > 0. -1 returns all candidates. default is -1.
- Returns:
dataframe with candidate name-pairs
- Parameters:
names_df (
DataFrame|Series)top_n (
int)
- Return type:
DataFrame
emm.pipeline.spark_entity_matching module¶
Module contents¶
- class emm.pipeline.PandasEntityMatching(parameters=None, supervised_models=None, name_col=None, entity_id_col=None, name_only=None, preprocessor=None, indexers=None, supervised_on=None, without_rank_features=None, with_legal_entity_forms_match=None, return_sm_features=None, supervised_model_object=None, aggregation_layer=None, aggregation_method=None, carry_on_cols=None, **kwargs)¶
Bases:
BaseEntityMatchingImplementation of EntityMatching using Pandas.
- Parameters:
parameters (
Optional[dict[str,Any]])supervised_models (
Optional[Mapping[str,Any]])name_col (
Optional[str])entity_id_col (
Optional[str])name_only (
Optional[bool])preprocessor (
Optional[str])indexers (
Optional[list])supervised_on (
Optional[bool])without_rank_features (
Optional[bool])with_legal_entity_forms_match (
Optional[bool])return_sm_features (
Optional[bool])supervised_model_object (
Optional[Pipeline])aggregation_layer (
Optional[bool])aggregation_method (
Optional[Literal['mean_score','max_frequency_nm_score']])carry_on_cols (
Optional[list[str]])
- add_aggregation_layer(account_col=None, freq_col=None, aggregation_method=None, blacklist=None, aggregation_layer=None)¶
Add or replace aggregation layer to spark pipeline
- Args:
account_col: account_col column indicates which names-to-match belongs together. default is “account”. freq_col: name frequency column, default is “counterparty_account_count_distinct”. aggregation_method: aggregation method: ‘name_clustering’ or ‘mean_score’. Default is ‘name_clustering’. blacklist: blacklist of names to skip in clustering. aggregation_layer: existing aggregation layer to add. Default is None, if so one is created.
- Parameters:
account_col (
Optional[str])freq_col (
Optional[str])aggregation_method (
Optional[str])blacklist (
Optional[list])aggregation_layer (
Optional[BaseEntityAggregation])
- Return type:
None
- add_supervised_model(path=None, model=None, name_only=True, store_key='nm_score', overwrite=True, return_features=None)¶
Add trained sklearn supervised model to existing pipeline
- Args:
path: file path of pickled sklearn pipeline. Or provide model directly. model: trained sklearn pipeline to add to spark supervised layer. name_only: name-only model? If false, presence of extra features (country) is checked. Default is True. store_key: storage key for new sklearn supervised model. default is ‘nm_score’. overwrite: overwrite existing model if store_key already used, default is True. return_features: bool to to return supervised model features. None means default: False.
- Parameters:
path (
Optional[str])model (
Optional[Pipeline])name_only (
bool)store_key (
str)overwrite (
bool)return_features (
Optional[bool])
- Return type:
None
- create_training_name_pairs(train_positive_names_to_match, create_negative_sample_fraction=0, n_train_ids=-1, random_seed=42, drop_duplicate_candidates=None, **kwargs)¶
Create name-pairs for training from positive names that match to the ground truth.
Positive names are names that are supposed to match to the ground truth. A fraction of the positive names can be converted to negative names, which are not supposed to match to the ground truth.
- Args:
- train_positive_names_to_match: pandas dataframe of positive names to match for training. A positive name
has a guaranteed match to a name in the ground truth. Two columns are needed: a name and id (to determine a corresponding match to the ground truth).
- create_negative_sample_fraction: fraction of ids converted to negative names. A negative name has
guaranteed no match to any name in the ground truth. default is 0: no negative names are created.
- n_train_ids: down-sample the positive names to match, keep only n_train_ids number of ids.
default value is -1 (keep all).
random_seed: random seed for down-sampling of ids. default is 42. drop_duplicate_candidates: if True drop any duplicate training candidates and keep just one,
if available keep the correct match. Recommended for string-similarity models, eg. with without_rank_features=True. default is False.
kwargs: extra key-word arguments meant to be passed to prepare_name_pairs_pd.
- Returns:
pandas dataframe with name-pair candidates to be used for training.
- Parameters:
train_positive_names_to_match (
DataFrame)create_negative_sample_fraction (
float)n_train_ids (
int)random_seed (
int)drop_duplicate_candidates (
Optional[bool])
- Return type:
DataFrame
- decrease_window_by_one_step()¶
Utility function for negative sample creation during training
This changes the parameter settings of the fitted model.
- fit(ground_truth_df, copy_ground_truth=False)¶
Fits name indexers on ground truth data.
Fit excludes the supervised model, which needs training list of names that match to the ground truth. See instead: cls.fit_classifier().
- Args:
ground_truth_df: spark dataframe with ground truth names and corresponding ids. copy_ground_truth: if true, keep a copy of the ground truth, useful for storage of the model.
- Returns:
self reference (for compatibility with sklearn models)
- Parameters:
ground_truth_df (
DataFrame)copy_ground_truth (
bool)
- Return type:
- fit_classifier(train_positive_names_to_match=None, train_name_pairs=None, create_negative_sample_fraction=0, n_train_ids=-1, random_seed=42, train_gt=None, store_key='nm_score', train_function=<function train_model>, score_columns=None, drop_duplicate_candidates=None, extra_features=None, **fit_kws)¶
Function to train the supervised model based on positive input names.
Positive names are names that are supposed to match to the ground truth. A fraction of the positive names can be converted to negative names, which are not supposed to match to the ground truth.
- Args:
- train_positive_names_to_match: pandas dataframe of positive names to match for training. A positive name
has a guaranteed match to a name in the ground truth. Two columns are needed: a name and id (to determine a corresponding match to the ground truth).
- train_name_pairs: pandas dataframe with training name pair candidates, an alternative to
train_positive_names_to_match. When not provided, train name pairs are created from positive names to match using self.create_training_name_pairs(). default is None (optional.)
- create_negative_sample_fraction: fraction of ids converted to negative names. A negative name has
guaranteed no match to any name in the ground truth. default is 0: no negative names are created.
- n_train_ids: down-sample the positive names to match, keep only n_train_ids number of ids.
default value is -1 (keep all).
random_seed: random seed for down-sampling of ids. default is 42. train_gt: pandas dataframe of ground truth names and ids for training the indexers. By default we assume
the the indexers have already been fit. default is None (optional).
store_key: storage key for new supervised model. default is ‘nm_score’. train_function: provide custom function to create and train model pipeline. optional. score_columns: list of columns with raw scores from indexers to pass to classifier.
default is None, meaning all indexer scores (e.g. cosine similarity values).
- drop_duplicate_candidates: if True drop any duplicate training candidates and keep just one,
if available keep the correct match. Recommended for string-similarity models, eg. with without_rank_features=True. default is False.
- extra_features: list of columns (and possibly functions) used for extra features calculation,
e.g. country if name_only=False, default is None. With
name_only=Falseinternallyextra_features=['country'].
fit_kws: extra kwargs passed on to model fit function. optional.
- Returns:
self reference (object including the trained supervised model)
- Parameters:
train_positive_names_to_match (
Optional[DataFrame])create_negative_sample_fraction (
float)n_train_ids (
int)random_seed (
int)train_gt (
Optional[DataFrame])drop_duplicate_candidates (
Optional[bool])extra_features (
Optional[list[str|tuple[str,Callable]]])
- Return type:
- increase_window_by_one_step()¶
Utility function for negative sample creation during training
This changes the parameter settings of the fitted model.
- initialize()¶
If you updated parameters of EntityMatching, you might want to initialize again.
- static load(emo_path, load_func=<function load_joblib>, override_parameters=None, name_col=None, entity_id_col=None, **kwargs)¶
Load the EMM object.
Below are the most common arguments. For complete list see emm.parameters.MODEL_PARAMS. These arguments are optional and update the parameters dictionary.
- Args:
emo_path: path to the EMM pickle file. load_func: function used for loading object. default is joblib.load() override_parameters: parameters that overwrite the settings of the EMM object. optional. name_col: name column in dataframe. default is “name”. entity_id_col: id column in dataframe. default is “id”. kwargs: extra key-word arguments are passed on to parameters dictionary.
- Returns:
instantiated EMM object
- Examples:
>>> # deserialize pickled EMM object and rename name column >>> em = PandasEntityMatching.load(emo_path, name_col='Name', entity_id_col='Id')
- Parameters:
emo_path (
str)load_func (
Callable)override_parameters (
Optional[Mapping[str,Any]])name_col (
Optional[str])entity_id_col (
Optional[str])
- Return type:
object
- save(emo_path, dump_func=functools.partial(<function dump>, compress=True))¶
Serialize the EMM object.
- Args:
emo_path: path to the EMM pickle file. dump_func: function used for dumping self. default is joblib.dump() with compression turned on.
- Parameters:
emo_path (
str)dump_func (
Callable)
- set_return_sm_features(return_features=True)¶
Toggle setting to return supervised model features
- Args:
return_features: bool to return supervised model features, default is True.
- test_classifier(test_names_to_match, test_gt=None)¶
Helper function for testing the supervised model.
Print multiple ML model metrics.
- Args:
test_names_to_match: test dataframe with names (and ids) to match. test_gt: provide alternative GT. optional, default is None.
- Parameters:
test_names_to_match (
DataFrame)test_gt (
Optional[DataFrame])
- transform(names_df, top_n=-1)¶
Matches given names against ground truth.
transform() returns a pandas dataframe with name-pair candidates.
- Args:
names_df: dataframe or series with names to be matched. top_n: return top-n candidates per name to match, top-n > 0. -1 returns all candidates. default is -1.
- Returns:
dataframe with candidate name-pairs
- Parameters:
names_df (
DataFrame|Series)top_n (
int)
- Return type:
DataFrame