emm.aggregation package¶
Submodules¶
emm.aggregation.base_entity_aggregation module¶
- class emm.aggregation.base_entity_aggregation.BaseEntityAggregation(score_col, account_col='account', index_col='entity_id', gt_entity_id_col='gt_entity_id', uid_col='uid', gt_uid_col='gt_uid', name_col='name_col', freq_col='counterparty_account_count_distinct', output_col='agg_score', preprocessed_col='preprocessed', gt_name_col='gt_name', gt_preprocessed_col='gt_preprocessed', correct_col='correct', aggregation_method='max_frequency_nm_score', blacklist=None, positive_set_col='positive_set')¶
Bases:
Pipeline- Parameters:
score_col (
str)account_col (
str)index_col (
str)gt_entity_id_col (
str)uid_col (
str)gt_uid_col (
str)name_col (
str)freq_col (
str)output_col (
str)preprocessed_col (
str)gt_name_col (
str)gt_preprocessed_col (
str)correct_col (
str)aggregation_method (
Literal['max_frequency_nm_score','mean_score'])blacklist (
Optional[list])positive_set_col (
str)
- get_group(dataframe)¶
- Return type:
list[str]
- get_gt_group()¶
- Return type:
list[str]
- abstract remove_blacklisted_names(df, preprocessed_col)¶
- Parameters:
df (
Any)preprocessed_col (
str)
- Return type:
Any
- emm.aggregation.base_entity_aggregation.is_series_unique(series)¶
- Parameters:
series (
Series)- Return type:
bool
- emm.aggregation.base_entity_aggregation.matching_max_candidate(df, group, score_col, name_col, account_col, freq_col, output_col, aggregation_method='max_frequency_nm_score')¶
This function aggregates all the names and its candidates of an account. If aggregation_method = ‘mean_score’ - Average the scores per GT and return the maximum.
Returns dataframe with a single row.
- Args:
df: Pandas DataFrame containing all the names of an account group: Grouping columns used for calculating agg_score, usually or (gt_entity_id, gt_uid) score_col: Score column on which the aggregation is performed name_col: name column used for name clustering account_col: account column used for name clustering freq_col: Frequency column used for the name clustering and weighted averages output_col: Name of column to store the final score aggregation_method: Aggregation method to use: name_clustering, mean_score, or max_frequency_nm_score
- Parameters:
df (
DataFrame)group (
list[str])score_col (
str)name_col (
str)account_col (
str)freq_col (
str)output_col (
str)aggregation_method (
Literal['max_frequency_nm_score','mean_score'])
- Return type:
DataFrame
emm.aggregation.pandas_entity_aggregation module¶
- class emm.aggregation.pandas_entity_aggregation.PandasEntityAggregation(score_col, account_col='account', index_col='entity_id', gt_entity_id_col='gt_entity_id', uid_col='uid', gt_uid_col='gt_uid', name_col='name', freq_col='counterparty_account_count_distinct', output_col='agg_score', preprocessed_col='preprocessed', gt_name_col='gt_name', gt_preprocessed_col='gt_preprocessed', correct_col='correct', aggregation_method='max_frequency_nm_score', blacklist=None)¶
Bases:
TransformerMixin,BaseEntityAggregationPandas name-matching aggregation code
- Parameters:
score_col (
str)account_col (
str)index_col (
str)gt_entity_id_col (
str)uid_col (
str)gt_uid_col (
str)name_col (
str)freq_col (
str)output_col (
str)preprocessed_col (
str)gt_name_col (
str)gt_preprocessed_col (
str)correct_col (
str)aggregation_method (
Literal['max_frequency_nm_score','mean_score'])blacklist (
Optional[list[str]])
- fit(X, y=None)¶
Dummy function, no fitting is required.
- Parameters:
X (
DataFrame)y (
Optional[Series])
- Return type:
TransformerMixin
- fit_transform(X, y=None)¶
Only calls transform(), no fitting required
- Parameters:
X (
DataFrame)y (
Optional[Series])
- Return type:
DataFrame
- remove_blacklisted_names(df, preprocessed_col='preprocessed')¶
- Parameters:
df (
DataFrame)preprocessed_col (
str)
- transform(X)¶
Combine scores of a group of name-pair candidates that belong together.
Natch a group of company names that belong together, to a company name in the ground truth.
- Args:
X: dataframe of scored candidates
- Returns:
dataframe of scored candidates, only one row per account
- Parameters:
X (
DataFrame)- Return type:
DataFrame|None
emm.aggregation.spark_entity_aggregation module¶
Module contents¶
- class emm.aggregation.PandasEntityAggregation(score_col, account_col='account', index_col='entity_id', gt_entity_id_col='gt_entity_id', uid_col='uid', gt_uid_col='gt_uid', name_col='name', freq_col='counterparty_account_count_distinct', output_col='agg_score', preprocessed_col='preprocessed', gt_name_col='gt_name', gt_preprocessed_col='gt_preprocessed', correct_col='correct', aggregation_method='max_frequency_nm_score', blacklist=None)¶
Bases:
TransformerMixin,BaseEntityAggregationPandas name-matching aggregation code
- Parameters:
score_col (
str)account_col (
str)index_col (
str)gt_entity_id_col (
str)uid_col (
str)gt_uid_col (
str)name_col (
str)freq_col (
str)output_col (
str)preprocessed_col (
str)gt_name_col (
str)gt_preprocessed_col (
str)correct_col (
str)aggregation_method (
Literal['max_frequency_nm_score','mean_score'])blacklist (
Optional[list[str]])
- fit(X, y=None)¶
Dummy function, no fitting is required.
- Parameters:
X (
DataFrame)y (
Optional[Series])
- Return type:
TransformerMixin
- fit_transform(X, y=None)¶
Only calls transform(), no fitting required
- Parameters:
X (
DataFrame)y (
Optional[Series])
- Return type:
DataFrame
- remove_blacklisted_names(df, preprocessed_col='preprocessed')¶
- Parameters:
df (
DataFrame)preprocessed_col (
str)
- transform(X)¶
Combine scores of a group of name-pair candidates that belong together.
Natch a group of company names that belong together, to a company name in the ground truth.
- Args:
X: dataframe of scored candidates
- Returns:
dataframe of scored candidates, only one row per account
- Parameters:
X (
DataFrame)- Return type:
DataFrame|None