emm.aggregation package

Submodules

emm.aggregation.base_entity_aggregation module

class emm.aggregation.base_entity_aggregation.BaseEntityAggregation(score_col, account_col='account', index_col='entity_id', gt_entity_id_col='gt_entity_id', uid_col='uid', gt_uid_col='gt_uid', name_col='name_col', freq_col='counterparty_account_count_distinct', output_col='agg_score', preprocessed_col='preprocessed', gt_name_col='gt_name', gt_preprocessed_col='gt_preprocessed', correct_col='correct', aggregation_method='max_frequency_nm_score', blacklist=None, positive_set_col='positive_set')

Bases: Pipeline

Parameters:
  • score_col (str)

  • account_col (str)

  • index_col (str)

  • gt_entity_id_col (str)

  • uid_col (str)

  • gt_uid_col (str)

  • name_col (str)

  • freq_col (str)

  • output_col (str)

  • preprocessed_col (str)

  • gt_name_col (str)

  • gt_preprocessed_col (str)

  • correct_col (str)

  • aggregation_method (Literal['max_frequency_nm_score', 'mean_score'])

  • blacklist (Optional[list])

  • positive_set_col (str)

get_group(dataframe)
Return type:

list[str]

get_gt_group()
Return type:

list[str]

abstract remove_blacklisted_names(df, preprocessed_col)
Parameters:
  • df (Any)

  • preprocessed_col (str)

Return type:

Any

emm.aggregation.base_entity_aggregation.is_series_unique(series)
Parameters:

series (Series)

Return type:

bool

emm.aggregation.base_entity_aggregation.matching_max_candidate(df, group, score_col, name_col, account_col, freq_col, output_col, aggregation_method='max_frequency_nm_score')

This function aggregates all the names and its candidates of an account. If aggregation_method = ‘mean_score’ - Average the scores per GT and return the maximum.

Returns dataframe with a single row.

Args:

df: Pandas DataFrame containing all the names of an account group: Grouping columns used for calculating agg_score, usually or (gt_entity_id, gt_uid) score_col: Score column on which the aggregation is performed name_col: name column used for name clustering account_col: account column used for name clustering freq_col: Frequency column used for the name clustering and weighted averages output_col: Name of column to store the final score aggregation_method: Aggregation method to use: name_clustering, mean_score, or max_frequency_nm_score

Parameters:
  • df (DataFrame)

  • group (list[str])

  • score_col (str)

  • name_col (str)

  • account_col (str)

  • freq_col (str)

  • output_col (str)

  • aggregation_method (Literal['max_frequency_nm_score', 'mean_score'])

Return type:

DataFrame

emm.aggregation.pandas_entity_aggregation module

class emm.aggregation.pandas_entity_aggregation.PandasEntityAggregation(score_col, account_col='account', index_col='entity_id', gt_entity_id_col='gt_entity_id', uid_col='uid', gt_uid_col='gt_uid', name_col='name', freq_col='counterparty_account_count_distinct', output_col='agg_score', preprocessed_col='preprocessed', gt_name_col='gt_name', gt_preprocessed_col='gt_preprocessed', correct_col='correct', aggregation_method='max_frequency_nm_score', blacklist=None)

Bases: TransformerMixin, BaseEntityAggregation

Pandas name-matching aggregation code

Parameters:
  • score_col (str)

  • account_col (str)

  • index_col (str)

  • gt_entity_id_col (str)

  • uid_col (str)

  • gt_uid_col (str)

  • name_col (str)

  • freq_col (str)

  • output_col (str)

  • preprocessed_col (str)

  • gt_name_col (str)

  • gt_preprocessed_col (str)

  • correct_col (str)

  • aggregation_method (Literal['max_frequency_nm_score', 'mean_score'])

  • blacklist (Optional[list[str]])

fit(X, y=None)

Dummy function, no fitting is required.

Parameters:
  • X (DataFrame)

  • y (Optional[Series])

Return type:

TransformerMixin

fit_transform(X, y=None)

Only calls transform(), no fitting required

Parameters:
  • X (DataFrame)

  • y (Optional[Series])

Return type:

DataFrame

remove_blacklisted_names(df, preprocessed_col='preprocessed')
Parameters:
  • df (DataFrame)

  • preprocessed_col (str)

transform(X)

Combine scores of a group of name-pair candidates that belong together.

Natch a group of company names that belong together, to a company name in the ground truth.

Args:

X: dataframe of scored candidates

Returns:

dataframe of scored candidates, only one row per account

Parameters:

X (DataFrame)

Return type:

DataFrame | None

emm.aggregation.spark_entity_aggregation module

Module contents

class emm.aggregation.PandasEntityAggregation(score_col, account_col='account', index_col='entity_id', gt_entity_id_col='gt_entity_id', uid_col='uid', gt_uid_col='gt_uid', name_col='name', freq_col='counterparty_account_count_distinct', output_col='agg_score', preprocessed_col='preprocessed', gt_name_col='gt_name', gt_preprocessed_col='gt_preprocessed', correct_col='correct', aggregation_method='max_frequency_nm_score', blacklist=None)

Bases: TransformerMixin, BaseEntityAggregation

Pandas name-matching aggregation code

Parameters:
  • score_col (str)

  • account_col (str)

  • index_col (str)

  • gt_entity_id_col (str)

  • uid_col (str)

  • gt_uid_col (str)

  • name_col (str)

  • freq_col (str)

  • output_col (str)

  • preprocessed_col (str)

  • gt_name_col (str)

  • gt_preprocessed_col (str)

  • correct_col (str)

  • aggregation_method (Literal['max_frequency_nm_score', 'mean_score'])

  • blacklist (Optional[list[str]])

fit(X, y=None)

Dummy function, no fitting is required.

Parameters:
  • X (DataFrame)

  • y (Optional[Series])

Return type:

TransformerMixin

fit_transform(X, y=None)

Only calls transform(), no fitting required

Parameters:
  • X (DataFrame)

  • y (Optional[Series])

Return type:

DataFrame

remove_blacklisted_names(df, preprocessed_col='preprocessed')
Parameters:
  • df (DataFrame)

  • preprocessed_col (str)

transform(X)

Combine scores of a group of name-pair candidates that belong together.

Natch a group of company names that belong together, to a company name in the ground truth.

Args:

X: dataframe of scored candidates

Returns:

dataframe of scored candidates, only one row per account

Parameters:

X (DataFrame)

Return type:

DataFrame | None