emm.aggregation package¶

Submodules¶

emm.aggregation.base_entity_aggregation module¶

class emm.aggregation.base_entity_aggregation.BaseEntityAggregation(score_col, account_col='account', index_col='entity_id', gt_entity_id_col='gt_entity_id', uid_col='uid', gt_uid_col='gt_uid', name_col='name_col', freq_col='counterparty_account_count_distinct', output_col='agg_score', preprocessed_col='preprocessed', gt_name_col='gt_name', gt_preprocessed_col='gt_preprocessed', correct_col='correct', aggregation_method='max_frequency_nm_score', blacklist=None, positive_set_col='positive_set')¶

Bases: Pipeline

Parameters:

score_col (str)
account_col (str)
index_col (str)
gt_entity_id_col (str)
uid_col (str)
gt_uid_col (str)
name_col (str)
freq_col (str)
output_col (str)
preprocessed_col (str)
gt_name_col (str)
gt_preprocessed_col (str)
correct_col (str)
aggregation_method (Literal['max_frequency_nm_score', 'mean_score'])
blacklist (Optional[list])
positive_set_col (str)

get_group(dataframe)¶

Return type:: list[str]

get_gt_group()¶

Return type:: list[str]

abstract remove_blacklisted_names(df, preprocessed_col)¶

Parameters:

df (Any)
preprocessed_col (str)

Return type:

Any

emm.aggregation.base_entity_aggregation.is_series_unique(series)¶

Parameters:: series (Series)
Return type:: bool

emm.aggregation.base_entity_aggregation.matching_max_candidate(df, group, score_col, name_col, account_col, freq_col, output_col, aggregation_method='max_frequency_nm_score')¶

This function aggregates all the names and its candidates of an account. If aggregation_method = ‘mean_score’ - Average the scores per GT and return the maximum.

Returns dataframe with a single row.

Args:: df: Pandas DataFrame containing all the names of an account group: Grouping columns used for calculating agg_score, usually or (gt_entity_id, gt_uid) score_col: Score column on which the aggregation is performed name_col: name column used for name clustering account_col: account column used for name clustering freq_col: Frequency column used for the name clustering and weighted averages output_col: Name of column to store the final score aggregation_method: Aggregation method to use: name_clustering, mean_score, or max_frequency_nm_score

Parameters:

df (DataFrame)
group (list[str])
score_col (str)
name_col (str)
account_col (str)
freq_col (str)
output_col (str)
aggregation_method (Literal['max_frequency_nm_score', 'mean_score'])

Return type:

DataFrame

emm.aggregation.pandas_entity_aggregation module¶

class emm.aggregation.pandas_entity_aggregation.PandasEntityAggregation(score_col, account_col='account', index_col='entity_id', gt_entity_id_col='gt_entity_id', uid_col='uid', gt_uid_col='gt_uid', name_col='name', freq_col='counterparty_account_count_distinct', output_col='agg_score', preprocessed_col='preprocessed', gt_name_col='gt_name', gt_preprocessed_col='gt_preprocessed', correct_col='correct', aggregation_method='max_frequency_nm_score', blacklist=None)¶

Bases: TransformerMixin, BaseEntityAggregation

Pandas name-matching aggregation code

Parameters:

score_col (str)
account_col (str)
index_col (str)
gt_entity_id_col (str)
uid_col (str)
gt_uid_col (str)
name_col (str)
freq_col (str)
output_col (str)
preprocessed_col (str)
gt_name_col (str)
gt_preprocessed_col (str)
correct_col (str)
aggregation_method (Literal['max_frequency_nm_score', 'mean_score'])
blacklist (Optional[list[str]])

fit(X, y=None)¶

Dummy function, no fitting is required.

Parameters:

X (DataFrame)
y (Optional[Series])

Return type:

TransformerMixin

fit_transform(X, y=None)¶

Only calls transform(), no fitting required

Parameters:

X (DataFrame)
y (Optional[Series])

Return type:

DataFrame

remove_blacklisted_names(df, preprocessed_col='preprocessed')¶

Parameters:

df (DataFrame)
preprocessed_col (str)

transform(X)¶

Combine scores of a group of name-pair candidates that belong together.

Natch a group of company names that belong together, to a company name in the ground truth.

Args:: X: dataframe of scored candidates
Returns:: dataframe of scored candidates, only one row per account

Parameters:: X (DataFrame)
Return type:: DataFrame | None

emm.aggregation.spark_entity_aggregation module¶

Module contents¶

class emm.aggregation.PandasEntityAggregation(score_col, account_col='account', index_col='entity_id', gt_entity_id_col='gt_entity_id', uid_col='uid', gt_uid_col='gt_uid', name_col='name', freq_col='counterparty_account_count_distinct', output_col='agg_score', preprocessed_col='preprocessed', gt_name_col='gt_name', gt_preprocessed_col='gt_preprocessed', correct_col='correct', aggregation_method='max_frequency_nm_score', blacklist=None)¶

Bases: TransformerMixin, BaseEntityAggregation

Pandas name-matching aggregation code

Parameters:

score_col (str)
account_col (str)
index_col (str)
gt_entity_id_col (str)
uid_col (str)
gt_uid_col (str)
name_col (str)
freq_col (str)
output_col (str)
preprocessed_col (str)
gt_name_col (str)
gt_preprocessed_col (str)
correct_col (str)
aggregation_method (Literal['max_frequency_nm_score', 'mean_score'])
blacklist (Optional[list[str]])

fit(X, y=None)¶

Dummy function, no fitting is required.

Parameters:

X (DataFrame)
y (Optional[Series])

Return type:

TransformerMixin

fit_transform(X, y=None)¶

Only calls transform(), no fitting required

Parameters:

X (DataFrame)
y (Optional[Series])

Return type:

DataFrame

remove_blacklisted_names(df, preprocessed_col='preprocessed')¶

Parameters:

df (DataFrame)
preprocessed_col (str)

transform(X)¶

Combine scores of a group of name-pair candidates that belong together.

Natch a group of company names that belong together, to a company name in the ground truth.

Args:: X: dataframe of scored candidates
Returns:: dataframe of scored candidates, only one row per account

Parameters:: X (DataFrame)
Return type:: DataFrame | None