emm package¶
Subpackages¶
- emm.aggregation package
- emm.base package
- emm.data package
- emm.features package
- emm.helper package
- emm.indexing package
- Submodules
- emm.indexing.base_indexer module
- emm.indexing.pandas_candidate_selection module
- emm.indexing.pandas_cos_sim_matcher module
- emm.indexing.pandas_naive_indexer module
- emm.indexing.pandas_normalized_tfidf module
- emm.indexing.pandas_sni module
- emm.indexing.spark_candidate_selection module
- emm.indexing.spark_character_tokenizer module
- emm.indexing.spark_cos_sim_matcher module
- emm.indexing.spark_indexing_utils module
- emm.indexing.spark_normalized_tfidf module
- emm.indexing.spark_sni module
- emm.indexing.spark_word_tokenizer module
- Module contents
- emm.loggers package
- emm.pipeline package
- Submodules
- emm.pipeline.base_entity_matching module
- emm.pipeline.pandas_entity_matching module
PandasEntityMatching
PandasEntityMatching.add_aggregation_layer()
PandasEntityMatching.add_supervised_model()
PandasEntityMatching.create_training_name_pairs()
PandasEntityMatching.decrease_window_by_one_step()
PandasEntityMatching.fit()
PandasEntityMatching.fit_classifier()
PandasEntityMatching.increase_window_by_one_step()
PandasEntityMatching.initialize()
PandasEntityMatching.load()
PandasEntityMatching.save()
PandasEntityMatching.set_return_sm_features()
PandasEntityMatching.test_classifier()
PandasEntityMatching.transform()
- emm.pipeline.spark_entity_matching module
- Module contents
PandasEntityMatching
PandasEntityMatching.add_aggregation_layer()
PandasEntityMatching.add_supervised_model()
PandasEntityMatching.create_training_name_pairs()
PandasEntityMatching.decrease_window_by_one_step()
PandasEntityMatching.fit()
PandasEntityMatching.fit_classifier()
PandasEntityMatching.increase_window_by_one_step()
PandasEntityMatching.initialize()
PandasEntityMatching.load()
PandasEntityMatching.save()
PandasEntityMatching.set_return_sm_features()
PandasEntityMatching.test_classifier()
PandasEntityMatching.transform()
- emm.preprocessing package
- Submodules
- emm.preprocessing.abbreviation_util module
- emm.preprocessing.base_name_preprocessor module
- emm.preprocessing.functions module
- emm.preprocessing.pandas_functions module
- emm.preprocessing.pandas_preprocessor module
- emm.preprocessing.spark_functions module
- emm.preprocessing.spark_preprocessor module
- Module contents
- emm.supervised_model package
- emm.threshold package
Submodules¶
emm.parameters module¶
Default parameters for Entity Matching.
emm.resources module¶
- emm.resources.data(name)¶
Return the full path filename of a shipped data file.
- Args:
name: The name of the data.
- Returns:
The full path filename of the data.
- Raises:
FileNotFoundError: If the data cannot be found.
- Parameters:
name (
str
)- Return type:
str
- emm.resources.notebook(name)¶
Return the full path filename of a tutorial notebook.
- Args:
name: The name of the notebook.
- Returns:
The full path filename of the notebook.
- Raises:
FileNotFoundError: If the notebook cannot be found.
- Parameters:
name (
str
)- Return type:
str
emm.version module¶
Module contents¶
- class emm.PandasEntityMatching(parameters=None, supervised_models=None, name_col=None, entity_id_col=None, name_only=None, preprocessor=None, indexers=None, supervised_on=None, without_rank_features=None, with_legal_entity_forms_match=None, return_sm_features=None, supervised_model_object=None, aggregation_layer=None, aggregation_method=None, carry_on_cols=None, **kwargs)¶
Bases:
BaseEntityMatching
Implementation of EntityMatching using Pandas.
- Parameters:
parameters (
Optional
[dict
[str
,Any
]])supervised_models (
Optional
[Mapping
[str
,Any
]])name_col (
Optional
[str
])entity_id_col (
Optional
[str
])name_only (
Optional
[bool
])preprocessor (
Optional
[str
])indexers (
Optional
[list
])supervised_on (
Optional
[bool
])without_rank_features (
Optional
[bool
])with_legal_entity_forms_match (
Optional
[bool
])return_sm_features (
Optional
[bool
])supervised_model_object (
Optional
[Pipeline
])aggregation_layer (
Optional
[bool
])aggregation_method (
Optional
[Literal
['mean_score'
,'max_frequency_nm_score'
]])carry_on_cols (
Optional
[list
[str
]])
- add_aggregation_layer(account_col=None, freq_col=None, aggregation_method=None, blacklist=None, aggregation_layer=None)¶
Add or replace aggregation layer to spark pipeline
- Args:
account_col: account_col column indicates which names-to-match belongs together. default is “account”. freq_col: name frequency column, default is “counterparty_account_count_distinct”. aggregation_method: aggregation method: ‘name_clustering’ or ‘mean_score’. Default is ‘name_clustering’. blacklist: blacklist of names to skip in clustering. aggregation_layer: existing aggregation layer to add. Default is None, if so one is created.
- Parameters:
account_col (
Optional
[str
])freq_col (
Optional
[str
])aggregation_method (
Optional
[str
])blacklist (
Optional
[list
])aggregation_layer (
Optional
[BaseEntityAggregation
])
- Return type:
None
- add_supervised_model(path=None, model=None, name_only=True, store_key='nm_score', overwrite=True, return_features=None)¶
Add trained sklearn supervised model to existing pipeline
- Args:
path: file path of pickled sklearn pipeline. Or provide model directly. model: trained sklearn pipeline to add to spark supervised layer. name_only: name-only model? If false, presence of extra features (country) is checked. Default is True. store_key: storage key for new sklearn supervised model. default is ‘nm_score’. overwrite: overwrite existing model if store_key already used, default is True. return_features: bool to to return supervised model features. None means default: False.
- Parameters:
path (
Optional
[str
])model (
Optional
[Pipeline
])name_only (
bool
)store_key (
str
)overwrite (
bool
)return_features (
Optional
[bool
])
- Return type:
None
- create_training_name_pairs(train_positive_names_to_match, create_negative_sample_fraction=0, n_train_ids=-1, random_seed=42, drop_duplicate_candidates=None, **kwargs)¶
Create name-pairs for training from positive names that match to the ground truth.
Positive names are names that are supposed to match to the ground truth. A fraction of the positive names can be converted to negative names, which are not supposed to match to the ground truth.
- Args:
- train_positive_names_to_match: pandas dataframe of positive names to match for training. A positive name
has a guaranteed match to a name in the ground truth. Two columns are needed: a name and id (to determine a corresponding match to the ground truth).
- create_negative_sample_fraction: fraction of ids converted to negative names. A negative name has
guaranteed no match to any name in the ground truth. default is 0: no negative names are created.
- n_train_ids: down-sample the positive names to match, keep only n_train_ids number of ids.
default value is -1 (keep all).
random_seed: random seed for down-sampling of ids. default is 42. drop_duplicate_candidates: if True drop any duplicate training candidates and keep just one,
if available keep the correct match. Recommended for string-similarity models, eg. with without_rank_features=True. default is False.
kwargs: extra key-word arguments meant to be passed to prepare_name_pairs_pd.
- Returns:
pandas dataframe with name-pair candidates to be used for training.
- Parameters:
train_positive_names_to_match (
DataFrame
)create_negative_sample_fraction (
float
)n_train_ids (
int
)random_seed (
int
)drop_duplicate_candidates (
Optional
[bool
])
- Return type:
DataFrame
- decrease_window_by_one_step()¶
Utility function for negative sample creation during training
This changes the parameter settings of the fitted model.
- fit(ground_truth_df, copy_ground_truth=False)¶
Fits name indexers on ground truth data.
Fit excludes the supervised model, which needs training list of names that match to the ground truth. See instead: cls.fit_classifier().
- Args:
ground_truth_df: spark dataframe with ground truth names and corresponding ids. copy_ground_truth: if true, keep a copy of the ground truth, useful for storage of the model.
- Returns:
self reference (for compatibility with sklearn models)
- Parameters:
ground_truth_df (
DataFrame
)copy_ground_truth (
bool
)
- Return type:
- fit_classifier(train_positive_names_to_match=None, train_name_pairs=None, create_negative_sample_fraction=0, n_train_ids=-1, random_seed=42, train_gt=None, store_key='nm_score', train_function=<function train_model>, score_columns=None, drop_duplicate_candidates=None, extra_features=None, **fit_kws)¶
Function to train the supervised model based on positive input names.
Positive names are names that are supposed to match to the ground truth. A fraction of the positive names can be converted to negative names, which are not supposed to match to the ground truth.
- Args:
- train_positive_names_to_match: pandas dataframe of positive names to match for training. A positive name
has a guaranteed match to a name in the ground truth. Two columns are needed: a name and id (to determine a corresponding match to the ground truth).
- train_name_pairs: pandas dataframe with training name pair candidates, an alternative to
train_positive_names_to_match. When not provided, train name pairs are created from positive names to match using self.create_training_name_pairs(). default is None (optional.)
- create_negative_sample_fraction: fraction of ids converted to negative names. A negative name has
guaranteed no match to any name in the ground truth. default is 0: no negative names are created.
- n_train_ids: down-sample the positive names to match, keep only n_train_ids number of ids.
default value is -1 (keep all).
random_seed: random seed for down-sampling of ids. default is 42. train_gt: pandas dataframe of ground truth names and ids for training the indexers. By default we assume
the the indexers have already been fit. default is None (optional).
store_key: storage key for new supervised model. default is ‘nm_score’. train_function: provide custom function to create and train model pipeline. optional. score_columns: list of columns with raw scores from indexers to pass to classifier.
default is None, meaning all indexer scores (e.g. cosine similarity values).
- drop_duplicate_candidates: if True drop any duplicate training candidates and keep just one,
if available keep the correct match. Recommended for string-similarity models, eg. with without_rank_features=True. default is False.
- extra_features: list of columns (and possibly functions) used for extra features calculation,
e.g. country if name_only=False, default is None. With
name_only=False
internallyextra_features=['country']
.
fit_kws: extra kwargs passed on to model fit function. optional.
- Returns:
self reference (object including the trained supervised model)
- Parameters:
train_positive_names_to_match (
Optional
[DataFrame
])create_negative_sample_fraction (
float
)n_train_ids (
int
)random_seed (
int
)train_gt (
Optional
[DataFrame
])drop_duplicate_candidates (
Optional
[bool
])extra_features (
Optional
[list
[str
|tuple
[str
,Callable
]]])
- Return type:
- increase_window_by_one_step()¶
Utility function for negative sample creation during training
This changes the parameter settings of the fitted model.
- initialize()¶
If you updated parameters of EntityMatching, you might want to initialize again.
- static load(emo_path, load_func=<function load_joblib>, override_parameters=None, name_col=None, entity_id_col=None, **kwargs)¶
Load the EMM object.
Below are the most common arguments. For complete list see emm.parameters.MODEL_PARAMS. These arguments are optional and update the parameters dictionary.
- Args:
emo_path: path to the EMM pickle file. load_func: function used for loading object. default is joblib.load() override_parameters: parameters that overwrite the settings of the EMM object. optional. name_col: name column in dataframe. default is “name”. entity_id_col: id column in dataframe. default is “id”. kwargs: extra key-word arguments are passed on to parameters dictionary.
- Returns:
instantiated EMM object
- Examples:
>>> # deserialize pickled EMM object and rename name column >>> em = PandasEntityMatching.load(emo_path, name_col='Name', entity_id_col='Id')
- Parameters:
emo_path (
str
)load_func (
Callable
)override_parameters (
Optional
[Mapping
[str
,Any
]])name_col (
Optional
[str
])entity_id_col (
Optional
[str
])
- Return type:
object
- save(emo_path, dump_func=functools.partial(<function dump>, compress=True))¶
Serialize the EMM object.
- Args:
emo_path: path to the EMM pickle file. dump_func: function used for dumping self. default is joblib.dump() with compression turned on.
- Parameters:
emo_path (
str
)dump_func (
Callable
)
- set_return_sm_features(return_features=True)¶
Toggle setting to return supervised model features
- Args:
return_features: bool to return supervised model features, default is True.
- test_classifier(test_names_to_match, test_gt=None)¶
Helper function for testing the supervised model.
Print multiple ML model metrics.
- Args:
test_names_to_match: test dataframe with names (and ids) to match. test_gt: provide alternative GT. optional, default is None.
- Parameters:
test_names_to_match (
DataFrame
)test_gt (
Optional
[DataFrame
])
- transform(names_df, top_n=-1)¶
Matches given names against ground truth.
transform() returns a pandas dataframe with name-pair candidates.
- Args:
names_df: dataframe or series with names to be matched. top_n: return top-n candidates per name to match, top-n > 0. -1 returns all candidates. default is -1.
- Returns:
dataframe with candidate name-pairs
- Parameters:
names_df (
DataFrame
|Series
)top_n (
int
)
- Return type:
DataFrame
- emm.set_logger(level=20, format='%(asctime)s %(name)-12s %(levelname)-8s %(message)s')¶