Parameters

When instantiating an EntityMatching object one can tune multiple parameters, in particular:

  • Which name column to use from the input data, and the id column of the GT,

  • The setting of the preprocessing pipeline (defaults should work okay),

  • Which indexers to use for the candidate selection, and with which settings,

  • To turn on/off the supervised layer, and which input features to use (name-only features, without rank features),

  • Whether to use name aggregation (turned off by default).

Below we go through the most important parameters to control the entity matching model.

Indexing

  • For the indexer parameters see the comments below.

  • Important to set both name_col and entity_id_col as entity-matching parameters. The ground truth dataset needs both a name column and entity-id column. A list of names to match needs only a name column.

# three example name-pair candidate generators:
# word-based and character-based cosine similarity, and sorted neighbouring indexing
indexers = [
    {
        "type": "cosine_similarity",
        "tokenizer": "words",        # word-based cosine similarity
        "ngram": 1,                  # 1-gram tokens only
        "num_candidates": 10,        # max 10 candidates per name-to-match
        "cos_sim_lower_bound": 0.,   # lower bound on cosine similarity
    },
    {
        "type": "cosine_similarity",
        "tokenizer": "characters",   # character-based cosine similarity
        "ngram": 2,                  # 2-gram character tokens only
        "num_candidates": 5,         # max 5 candidates per name-to-match
        "cos_sim_lower_bound": 0.2,  # lower bound on cosine similarity
    },
    {
        "type": "sni",
        "window_length": 3,          # sorted neighbouring indexing window of size 3.
    },
]
em_params = {
    "name_col": "Name",                     # important to set both index and name columns
    "entity_id_col": "Index",
    "indexers": indexers,
    "carry_on_cols": [],                    # names of columns in the GT and names-to-match dataframes passed on by the indexers. GT columns get prefix 'gt_'.
    "supervised_on": False,                 # no initial supervised model to select best candidates right now
    "name_only": True,                      # only consider name information for matching, e.g. not "country" info
    "without_rank_features": False,         # add rank-based features for improved probability of match
    "with_legal_entity_forms_match": True,  # add feature that indicates match of legal entity forms (eg. ltd != co)
    "aggregation_layer": False,
}
# initialize the entity matcher
p = PandasEntityMatching(em_params)
# prepare the indexers based on the ground truth names: e.g. fit the tfidf matrix of the first indexer.
p.fit(ground_truth)

# pandas dataframe with name-pair candidates, made by the indexers. all names have been preprocessed.
candidates_pd = p.transform(test_names)
candidates_pd.head()

In the candidates dataframe, the indexer output scores are called score_0, score_1, etc by default.

Supervised Layer

The classifier can be trained to give a string similarity score or a probability of match. Both types of score are useful, in particular when there are many good-looking matches to choose between.

  • With name_only=True the entity-matcher only consider name information for matching. When set to false, it also considers country information, set with country_col.

  • The optional extra_features is a list of extra columns (and optionally function to process them) between GT and names-to-match that are used for feature calculation (GT==ntm). See class PandasFeatureExtractor for more details and also carry_on_cols indexer option above.) With name_only=False internally extra_features=['country'].

  • The use of rank features can be turned off with the EMM parameter without_rank_features=True.

  • The use of legal entity form matching can be turned on with the EMM parameter with_legal_entity_forms_match=True.

  • The flag create_negative_sample_fraction=0.5 controls the fraction of positive names (those known to have a match) artificially converted into negative names (without a proper match).

  • The flag drop_duplicate_candidates=True drop any duplicate training candidates and keep just one, if available keep the correct match. Recommended for string-similarity models, eg. with without_rank_features=True. default is False.

# create and fit a supervised model for the PandasEntityMatching object to pick the best match (this takes a while)
# input is "positive" names column 'Name' that are all supposed to match to the ground truth,
# and an id column 'Index' to check with candidate name-pairs are matching and which not.
# A fraction of these names, here 0.50, can be artificially turned into negative names (no match to the ground truth).
# (internally candidate name-pairs are automatically generated, which are input for the classification)
# this call sets supervised_on=True.
p.fit_classifier(train_positive_names_to_match=train_names, create_negative_sample_fraction=0.5,
                 drop_duplicate_candidates=True, extra_features=None)

# generated name-pair candidates, now with classifier-based probability of match.
# Input is the names' column 'Name'. In the output candidates df, see extra column 'nm_score'.
candidates_scored_pd = p.transform(test_names)
candidates_pd.head()

In the candidates dataframe, the classification output score is called nm_score by default.

The trained sklearn model is accessible under p.supervised_models['nm_score'].

Instead of calling p.fit_classifier(), an independently trained sklearn model can be provided as well through p.add_supervised_model(skl_model).

Aggregation Layer

Optionally, the EMM package can also be used to match a group of company names that belong together, to a common company name in the ground truth. For example, all different names used to address an external bank account. This step aggregates the name-matching scores from the supervised layer into a single match.

It is important to provide:

  • account_col specifies which names belong together in one group. Default value is account.

  • freq_col specifies the weight of each name in a group. For example the frequency of how often a name has been encountered.

  • The score column to aggregate is set with score_col. By default set to the name-matching score nm_score, e.g. but can also be a cosine similarity score such as score_0.

# add aggregation layer to the EMM object
# this sets aggregation_layer=True.
p.fit(gt)
p.add_aggregation_layer(
    score_col="nm_score",
    aggregation_method="max_frequency_nm_score",
    account_col="account",
    freq_col="counterparty_account_count_distinct",
)
candidates_pd = p.transform(account_data)
candidates_pd.head()

The aggregate output score is called agg_score by default.