emm.supervised_model package

Submodules

emm.supervised_model.base_supervised_model module

class emm.supervised_model.base_supervised_model.BaseSupervisedModel

Bases: Module

emm.supervised_model.base_supervised_model.calc_features_from_sm(sm, input, features_name='feat')
Parameters:
  • sm (Pipeline)

  • input (DataFrame)

emm.supervised_model.base_supervised_model.create_new_model_pipeline(name_only=True, feature_args=None, xgb_args=None)
Parameters:
  • name_only (bool)

  • feature_args (Optional[dict])

  • xgb_args (Optional[dict])

Return type:

Pipeline

emm.supervised_model.base_supervised_model.features_schema_from_sm(sm, return_spark_types=False)
Parameters:

sm (Pipeline)

emm.supervised_model.base_supervised_model.train_model(train_df, vocabulary=None, name_only=False, without_rank_features=False, positive_set_col='positive_set', custom_model=None, score_columns=None, with_legal_entity_forms_match=False, drop_features=None, n_jobs=-1, positive_only=False, extra_features=None, **feature_kws)

Train the supervised pipeline

No testing. Input dataset contains 1 row per candidate

Args:

train_df: input name-pairs to train on. See prepare_name_pairs(). vocabulary: vocabulary of common words. See create_vocabulary(). name_only: use name-only features. Default is false. without_rank_features: without generated rank features, default is false. positive_set_col: name of positive_set column, default is ‘positive_set’. custom_model: custom pipeline, default is None. score_columns: list of columns with raw scores from indexers to pass to classifier.

default is None, meaning all indexer scores (e.g. cosine similarity values).

with_legal_entity_forms_match: if True, then add match of legal entity forms. drop_features: list of features to drop at end of feature calculation, before sm. default is None. n_jobs: number of parallel jobs passed on to model. Default -1. positive_only: if true, train on positive names only and reject negative ones. default is False. extra_features: list of columns (and possibly functions) used for extra features calculation,

e.g. country if name_only=False, default is None. With name_only=False internally extra_features=['country'].

feature_kws: extra kwargs passed on to model init function.

Returns:

trained model

emm.supervised_model.base_supervised_model.train_test_model(dataset, vocabulary=None, name_only=False, without_rank_features=False, n_folds=8, account_col='account', uid_col='uid', random_state=42, positive_set_col='positive_set', benchmark_col='score_0', custom_model=None, score_columns=None, with_legal_entity_forms_match=False, drop_features=None, n_jobs=-1, positive_only=False, extra_features=None)

Train and test the supervised pipeline

Input dataset contains 1 row per candidate

Args:

dataset: input name-pairs to train on and validate. See prepare_name_pairs(). vocabulary: vocabulary of common words. See create_vocabulary(). name_only: use name-only features. Default is false. without_rank_features: without generated rank features, default is false. n_folds: number of folds. One is used for validation. account_col: account column, default is “account”. uid_col: uid column, default is “uid”. random_state: random seed, default is 42. positive_set_col: name of positive_set column, default is ‘positive_set’. benchmark_col: for benchmark validation, default score column is “score_0”. custom_model: custom pipeline, default is None. score_columns: list of columns with raw scores from indexers to pass to classifier.

default is None, meaning all indexer scores (e.g. cosine similarity values).

with_legal_entity_forms_match: if True, then add match of legal entity forms drop_features: list of features to drop at end of feature calculation, before sm. default is None. n_jobs: number of parallel jobs passed on to model. Default -1. positive_only: if true, train on positive names only and reject negative ones. default is False. extra_features: list of columns (and possibly functions) used for extra features calculation,

e.g. country if name_only=False, default is None. With name_only=False internally extra_features=['country'].

Returns:

tuple of trained model and scored dataset.

emm.supervised_model.pandas_supervised_model module

class emm.supervised_model.pandas_supervised_model.PandasSupervisedLayerTransformer(supervised_models, best_score_col='nm_score', return_features=False, *args, **kwargs)

Bases: TransformerMixin, BaseSupervisedModel

Pandas implementation of supervised model(s) transformer

Parameters:
  • supervised_models (Mapping[str, dict])

  • best_score_col (str | None)

  • return_features (bool)

  • args (Any)

  • kwargs (Any)

calc_features(X)

Calculate the name-pair features.

Append calculated features to the input dataframe

Parameters:

X (DataFrame)

Return type:

DataFrame

calc_score(X)

Calculate the score using supervised model.

Supervised model is run for each group on uid separately.

Parameters:

X (DataFrame)

Return type:

DataFrame

fit(X, y=None)

Fitting of CalcFeatures model of untrained supervised model.

When an untrained supervised model has been provided, calling fit() updates the vocabularies of the CalcFeatures module, if that is present in the pipeline under key ‘feat’.

To update the vocabularies, provide a list of processed ground truth names.

When this has been done, and return_features=True, then calling transform() returns the features calculated by CalcFeatures.

Args:

X: processed ground-truth names. y: ignored

Returns:

self

Parameters:
  • X (DataFrame)

  • y (Optional[Series])

Return type:

PandasSupervisedLayerTransformer

fit_transform(X, y=None)

Placeholder for fit_transform

This avoids unnecessary transform gt during SklearnPipeline.fit_transform(gt).

(The sklearn Pipeline is doing fit_transform for all stages excluding the last one, and with supervised model the CandidateSelection stage is an intermediate step.)

Args:

X: input dataframe for fitting. y: ignored.

Parameters:
  • X (DataFrame)

  • y (Optional[Series])

Return type:

None

select_best_score(X, group_cols, best_score_col='nm_score', sort_cols=None, sort_asc=None, best_match_col='best_match', best_rank_col='best_rank', gt_uid_col='gt_uid')

Select final best score from supervised model (before penalty calculation).

Returned dataframe will be sorted by group_cols + sort_cols to make it easier to calculate penalty.

Args:

X: pandas DataFrame with scores from supervised model group_cols: column name or list of column names used in aggregation best_score_col: sort these scores in descending order. default is “nm_score”. sort_cols: (optional) list of columns used in ordering the results sort_asc: (optional) list of booleans to determine ascending order of sort_cols best_match_col: column indicating best match of all name-matching scores. “best_match”. best_rank_col: column with rank of sorted scores. default is “best_rank”. gt_uid_col: column indicating name of gt uid. default id “gt_uid_col”.

Parameters:
  • X (DataFrame)

  • group_cols (list[str])

  • best_score_col (str | None)

  • sort_cols (Optional[list[str]])

  • sort_asc (Optional[list[bool]])

  • best_match_col (str)

  • best_rank_col (str)

  • gt_uid_col (str | None)

Return type:

DataFrame

transform(X)

Supervised layer transformation for name matching of name-pair candidates.

PandasSupervisedLayerTransformer is used to score each candidate name-pair, and based on the scoring to pick the best ground truth name with each name-to-match.

When return_features=True calling transform() also returns the features calculated by CalcFeatures.

Args:

X: input name-pair candidates for scoring.

Returns:

candidates dataframe including the name-matching scoring column nm_score.

Parameters:

X (DataFrame)

Return type:

DataFrame | None

emm.supervised_model.spark_supervised_model module

Module contents

class emm.supervised_model.PandasSupervisedLayerTransformer(supervised_models, best_score_col='nm_score', return_features=False, *args, **kwargs)

Bases: TransformerMixin, BaseSupervisedModel

Pandas implementation of supervised model(s) transformer

Parameters:
  • supervised_models (Mapping[str, dict])

  • best_score_col (str | None)

  • return_features (bool)

  • args (Any)

  • kwargs (Any)

calc_features(X)

Calculate the name-pair features.

Append calculated features to the input dataframe

Parameters:

X (DataFrame)

Return type:

DataFrame

calc_score(X)

Calculate the score using supervised model.

Supervised model is run for each group on uid separately.

Parameters:

X (DataFrame)

Return type:

DataFrame

fit(X, y=None)

Fitting of CalcFeatures model of untrained supervised model.

When an untrained supervised model has been provided, calling fit() updates the vocabularies of the CalcFeatures module, if that is present in the pipeline under key ‘feat’.

To update the vocabularies, provide a list of processed ground truth names.

When this has been done, and return_features=True, then calling transform() returns the features calculated by CalcFeatures.

Args:

X: processed ground-truth names. y: ignored

Returns:

self

Parameters:
  • X (DataFrame)

  • y (Optional[Series])

Return type:

PandasSupervisedLayerTransformer

fit_transform(X, y=None)

Placeholder for fit_transform

This avoids unnecessary transform gt during SklearnPipeline.fit_transform(gt).

(The sklearn Pipeline is doing fit_transform for all stages excluding the last one, and with supervised model the CandidateSelection stage is an intermediate step.)

Args:

X: input dataframe for fitting. y: ignored.

Parameters:
  • X (DataFrame)

  • y (Optional[Series])

Return type:

None

select_best_score(X, group_cols, best_score_col='nm_score', sort_cols=None, sort_asc=None, best_match_col='best_match', best_rank_col='best_rank', gt_uid_col='gt_uid')

Select final best score from supervised model (before penalty calculation).

Returned dataframe will be sorted by group_cols + sort_cols to make it easier to calculate penalty.

Args:

X: pandas DataFrame with scores from supervised model group_cols: column name or list of column names used in aggregation best_score_col: sort these scores in descending order. default is “nm_score”. sort_cols: (optional) list of columns used in ordering the results sort_asc: (optional) list of booleans to determine ascending order of sort_cols best_match_col: column indicating best match of all name-matching scores. “best_match”. best_rank_col: column with rank of sorted scores. default is “best_rank”. gt_uid_col: column indicating name of gt uid. default id “gt_uid_col”.

Parameters:
  • X (DataFrame)

  • group_cols (list[str])

  • best_score_col (str | None)

  • sort_cols (Optional[list[str]])

  • sort_asc (Optional[list[bool]])

  • best_match_col (str)

  • best_rank_col (str)

  • gt_uid_col (str | None)

Return type:

DataFrame

transform(X)

Supervised layer transformation for name matching of name-pair candidates.

PandasSupervisedLayerTransformer is used to score each candidate name-pair, and based on the scoring to pick the best ground truth name with each name-to-match.

When return_features=True calling transform() also returns the features calculated by CalcFeatures.

Args:

X: input name-pair candidates for scoring.

Returns:

candidates dataframe including the name-matching scoring column nm_score.

Parameters:

X (DataFrame)

Return type:

DataFrame | None

emm.supervised_model.train_model(train_df, vocabulary=None, name_only=False, without_rank_features=False, positive_set_col='positive_set', custom_model=None, score_columns=None, with_legal_entity_forms_match=False, drop_features=None, n_jobs=-1, positive_only=False, extra_features=None, **feature_kws)

Train the supervised pipeline

No testing. Input dataset contains 1 row per candidate

Args:

train_df: input name-pairs to train on. See prepare_name_pairs(). vocabulary: vocabulary of common words. See create_vocabulary(). name_only: use name-only features. Default is false. without_rank_features: without generated rank features, default is false. positive_set_col: name of positive_set column, default is ‘positive_set’. custom_model: custom pipeline, default is None. score_columns: list of columns with raw scores from indexers to pass to classifier.

default is None, meaning all indexer scores (e.g. cosine similarity values).

with_legal_entity_forms_match: if True, then add match of legal entity forms. drop_features: list of features to drop at end of feature calculation, before sm. default is None. n_jobs: number of parallel jobs passed on to model. Default -1. positive_only: if true, train on positive names only and reject negative ones. default is False. extra_features: list of columns (and possibly functions) used for extra features calculation,

e.g. country if name_only=False, default is None. With name_only=False internally extra_features=['country'].

feature_kws: extra kwargs passed on to model init function.

Returns:

trained model

emm.supervised_model.train_test_model(dataset, vocabulary=None, name_only=False, without_rank_features=False, n_folds=8, account_col='account', uid_col='uid', random_state=42, positive_set_col='positive_set', benchmark_col='score_0', custom_model=None, score_columns=None, with_legal_entity_forms_match=False, drop_features=None, n_jobs=-1, positive_only=False, extra_features=None)

Train and test the supervised pipeline

Input dataset contains 1 row per candidate

Args:

dataset: input name-pairs to train on and validate. See prepare_name_pairs(). vocabulary: vocabulary of common words. See create_vocabulary(). name_only: use name-only features. Default is false. without_rank_features: without generated rank features, default is false. n_folds: number of folds. One is used for validation. account_col: account column, default is “account”. uid_col: uid column, default is “uid”. random_state: random seed, default is 42. positive_set_col: name of positive_set column, default is ‘positive_set’. benchmark_col: for benchmark validation, default score column is “score_0”. custom_model: custom pipeline, default is None. score_columns: list of columns with raw scores from indexers to pass to classifier.

default is None, meaning all indexer scores (e.g. cosine similarity values).

with_legal_entity_forms_match: if True, then add match of legal entity forms drop_features: list of features to drop at end of feature calculation, before sm. default is None. n_jobs: number of parallel jobs passed on to model. Default -1. positive_only: if true, train on positive names only and reject negative ones. default is False. extra_features: list of columns (and possibly functions) used for extra features calculation,

e.g. country if name_only=False, default is None. With name_only=False internally extra_features=['country'].

Returns:

tuple of trained model and scored dataset.