emm.indexing package¶

Submodules¶

emm.indexing.base_indexer module¶

class emm.indexing.base_indexer.BaseIndexer¶

Bases: Module

Base implementation of Indexer class

decrease_window_by_one_step()¶

Utility function for negative sample creation during training

This should change the parameter settings of the fitted model.

increase_window_by_one_step()¶

Utility function for negative sample creation during training

This should change the parameter settings of the fitted model.

static version()¶

class emm.indexing.base_indexer.CosSimBaseIndexer(num_candidates)¶

Bases: BaseIndexer

Base implementation of CosSimIndexer class

Parameters:: num_candidates (int)

decrease_window_by_one_step()¶

Utility function for negative sample creation during training

This changes the parameter settings of the fitted model.

Return type:: None

increase_window_by_one_step()¶

Utility function for negative sample creation during training

This changes the parameter settings of the fitted model.

Return type:: None

class emm.indexing.base_indexer.SNBaseIndexer(window_length)¶

Bases: BaseIndexer

Base implementation of SN Indexer class

Parameters:: window_length (int)

decrease_window_by_one_step()¶

Utility function for negative sample creation during training

This changes the parameter settings of the fitted model.

Return type:: None

increase_window_by_one_step()¶

Utility function for negative sample creation during training

This changes the parameter settings of the fitted model.

Return type:: None

emm.indexing.pandas_candidate_selection module¶

class emm.indexing.pandas_candidate_selection.PandasCandidateSelectionTransformer(indexers, uid_col=None, carry_on_cols=None, with_no_matches=True)¶

Bases: TransformerMixin

Pandas middleware class that aggregates candidate pairs for possible matches.

Parameters:

indexers (list[BaseIndexer])
uid_col (Optional[str])
carry_on_cols (Optional[list[str]])
with_no_matches (bool | None)

decrease_window_by_one_step()¶

Utility function for negative sample creation during training

Return type:: None

fit(X, y=None)¶

Fit the indexers to ground truth names

For example this creates TFIDF matrices for the cosine similarity indexers.

Args:: X: ground truth dataframe with preprocessed names. y: ignored.
Returns:: self

Parameters:

X (DataFrame)
y (Optional[Series])

Return type:

TransformerMixin

fit_transform(X, y=None)¶

Tailored placeholder for fit_transform

Only calls fit(gt), this avoids the unnecessary transform gt during SklearnPipeline.fit_transform(gt).

The sklearn Pipeline is doing fit_transform for all stages excluding the last one, and with supervised model the CandidateSelection stage is an intermediate step.

Args:: X: Pandas dataframe with names that are used to fit the indexers. y: ignored.
Returns:: Pandas dataframe processed ground truth names.

Parameters:

X (DataFrame)
y (Optional[Series])

Return type:

DataFrame

increase_window_by_one_step()¶

Utility function for negative sample creation during training

Return type:: None

transform(X)¶

transform matches X dataset to the previously fitted ground truth.

Args:

X: Pandas dataframe with preprocessed names that should be matched

Returns:

Pandas dataframe with the candidate matches returned by indexers. Each row contains single pair of candidates.: Columns gt_uid, uid contains index value from ground truth and X. Optionally id column (specified by self.uid_col) and carry on columns (specified by self.carry_on_cols) are copied from gt/X dataframes with the prefixes: gt_ or `. Any additional columns calculated by indexers are also preserved (i.e. score).

Parameters:: X (DataFrame)
Return type:: DataFrame

emm.indexing.pandas_candidate_selection.select_with_prefix(df, cols, prefix)¶

Parameters:

df (DataFrame)
cols (list[str])
prefix (str)

Return type:

DataFrame

emm.indexing.pandas_cos_sim_matcher module¶

class emm.indexing.pandas_cos_sim_matcher.PandasCosSimIndexer(input_col='preprocessed', tokenizer='words', ngram=1, binary_countvectorizer=False, num_candidates=5, cos_sim_lower_bound=0.5, partition_size=5000, max_features=None, n_jobs=1, spark_session=None, blocking_func=None, dtype=<class 'numpy.float32'>, indexer_id=None)¶

Bases: TransformerMixin, CosSimBaseIndexer

Cosine similarity indexer to generate candidate name-pairs of possible matches

Parameters:

input_col (str)
tokenizer (Literal['words', 'characters'])
ngram (int)
binary_countvectorizer (bool)
num_candidates (int)
cos_sim_lower_bound (float)
partition_size (int)
max_features (Optional[int])
n_jobs (int)
spark_session (Optional[Any])
blocking_func (Union[Callable[[str], str], str, None])
dtype (type[float])
indexer_id (Optional[int])

calc_score(name1, name2)¶

Parameters:

name1 (Series)
name2 (Series)

Return type:

DataFrame

column_prefix()¶

Return type:: str

fit(X, y=None)¶

Fit the cosine similarity indexers to ground truth names

This creates TFIDF weights and matrix based on the ground truth names.

Args:: X: ground truth dataframe with preprocessed names. y: ignored.
Returns:: self

Parameters:

X (DataFrame)
y (Optional[Series])

Return type:

TransformerMixin

transform(X, multiple_indexers=None)¶

transform matches X dataset to the previously fitted ground truth.

Args:: X: Pandas dataframe with preprocessed names that should be matched multiple_indexers: ignored
Returns:: Pandas dataframe with the candidate matches returned by the indexer. Each row contains single pair of candidates. Columns gt_uid, uid contains index value from ground truth and X. Optionally id column (specified by self.uid_col) and carry on columns (specified by self.carry_on_cols) are copied from gt/X dataframes with the prefixes: gt_ or `. Any additional columns calculated by indexers are also preserved (i.e. score).

Parameters:

X (DataFrame)
multiple_indexers (Optional[bool])

Return type:

DataFrame

emm.indexing.pandas_naive_indexer module¶

emm.indexing.pandas_naive_indexer.NaiveIndexer¶: alias of PandasNaiveIndexer

class emm.indexing.pandas_naive_indexer.PandasNaiveIndexer(indexer_id=None)¶

Bases: TransformerMixin, BaseIndexer

Naive O(n^2) indexer for small datasets. Not for production use.

Parameters:: indexer_id (Optional[int])

fit(X, y=None)¶

Dummy function, no fitting required.

Parameters:

X (DataFrame)
y (Optional[Series])

Return type:

TransformerMixin

transform(X, spark_session=None, multiple_indexers=False)¶

Create all possible name-pairs

Args:: X: dataframe with (processed) input names to match to the ground truth. spark_session: ignored multiple_indexers: ignored
Returns:: dataframe with all possible candidate name pairs.

Parameters:

X (DataFrame)
spark_session (Optional[Any])
multiple_indexers (bool)

Return type:

DataFrame

emm.indexing.pandas_normalized_tfidf module¶

Customized TFIDF vectorization.

class emm.indexing.pandas_normalized_tfidf.PandasNormalizedTfidfVectorizer(**kwargs)¶

Bases: TfidfVectorizer

Implementation of customized TFIDF vectorizer

Parameters:: kwargs (Any)

fit(X)¶

Fit the TFIDF vectorizer.

Args:: X: dataframe with preprocessed names
Returns:: self

Parameters:: X (Series | DataFrame)
Return type:: TfidfVectorizer

fit_transform(raw_documents, y=None)¶

Implementation of fit followed by transform

Args:: raw_documents: dataframe with preprocessed input names. y: ignored.
Returns:: normalized tfidf vectors of names.

Parameters:

raw_documents (Series | DataFrame)
y (Optional[Any])

Return type:

csr_matrix

transform(X)¶

Apply the fitted TFIDF vectorizer

Args:: X: dataframe with preprocessed names
Returns:: normalized tfidf vectors of names

Parameters:: X (Series | DataFrame)
Return type:: csr_matrix

transform_parallel(X, n_jobs=-1)¶

Parallel apply the fitted TFIDF vectorizer

Inspired by: https://github.com/scikit-learn/scikit-learn/issues/7635#issuecomment-254407618

Args:: X: dataframe with preprocessed names n_jobs: desired number of parallel jobs. default is all available cores.
Returns:: normalized tfidf vectors of names

Parameters:

X (Series | DataFrame)
n_jobs (int)

Return type:

csr_matrix

emm.indexing.pandas_sni module¶

class emm.indexing.pandas_sni.PandasSortedNeighbourhoodIndexer(input_col='preprocessed', window_length=3, mapping_func=None, indexer_id=None)¶

Bases: TransformerMixin, SNBaseIndexer

Pandas transformer for sorted neighbourhood indexing

Parameters:

input_col (str)
window_length (int)
mapping_func (Optional[Callable[[str], str]])
indexer_id (Optional[int])

calc_score(name1, name2)¶

Parameters:

name1 (Series)
name2 (Series)

Return type:

DataFrame

column_prefix()¶

Return type:: str

fit(X, y=None)¶

Default Estimator action on fitting with ground truth names.

If custom mapping function is defined, then it is applied.

Args:: X: data frame with ground truth names y: ignored
Returns:: self

Parameters:

X (DataFrame)
y (Optional[Series])

Return type:

TransformerMixin

property store_ground_truth: bool¶

transform(X, multiple_indexers=False)¶

Default Model action on transforming names to match

Args:: X: dataframe with names to match. multiple_indexers: ignored.
Returns:: dataframe with candidate SNI name-pairs

Parameters:

X (DataFrame)
multiple_indexers (bool)

Return type:

DataFrame

emm.indexing.spark_candidate_selection module¶

emm.indexing.spark_character_tokenizer module¶

emm.indexing.spark_cos_sim_matcher module¶

emm.indexing.spark_indexing_utils module¶

Helper function for name matching model save and load

emm.indexing.spark_indexing_utils.as_matrix(vec, dense=False)¶

Convert a pyspark.ml.linalg.DenseVector to numpy matrix (only a single row) Convert a pyspark.ml.linalg.SparseVector to scipy.sparse.csr matrix (only a single row)

Args:: vec: vector dense: bool
Returns:: Numpy matrix / scipy csr matrix

Parameters:

vec (DenseVector | SparseVector)
dense (bool)

emm.indexing.spark_indexing_utils.collect_matrix(dist_matrix, uid_col, feature_col, blocking_col=None)¶: Convert a distributed matrix (spark.sql.Column of pyspark.ml.linalg.SparseVector) to a local matrix (scipy.sparse.csr matrix), and keep ground truth indices along with matrix: - the returned indices is a 1d np.array containing the ground-truth uid - the return matrix has the same integer position index as the indices In the blocking case it returns dicts where the key is the block and the value is the same as describe above.

emm.indexing.spark_indexing_utils.curry(func, *args)¶: Curry a function so that only a single argument remains. This is required for rdd.mapPartitions()

emm.indexing.spark_indexing_utils.dot_product(vec1, vec2)¶

Dot product of two pyspark.ml.linalg.SparseVector for example It works for pyspark.ml*.linalg.*Vector.dot

Parameters:

vec1 (SparseVector | DenseVector)
vec2 (SparseVector | DenseVector)

Return type:

float

emm.indexing.spark_indexing_utils.down_casting_int(a)¶

Automatically downcast integer to the smallest int type according the minimum and maximum value of the array

Parameters:: a (array)

emm.indexing.spark_indexing_utils.explode_candidates(df, with_rank=True, separator='_')¶: Change data structure from one row per names_to_match with a list candidates to one row per candidate

emm.indexing.spark_indexing_utils.flatten_df(nested_df, nested_cols, separator='_', keep_root_name=True)¶: Flatten all nested columns that are in nested_cols nested_cols: either one struct column or list of struct columns

emm.indexing.spark_indexing_utils.stack_features(matrices, dense=False)¶: Combine multiple (>=1) feature matrices to a larger one

emm.indexing.spark_indexing_utils.take_topn_per_group(df, n, group, order_by=None, method='exactly', keep_col=True)¶

Take only top-n rows per group to remove data skewness. order_by should be a tuple like: (F.col(‘C’), ) Method can have these values: ‘at_most’ can in some situation remove accounts ‘at_least_n_different_order_values’ can lead to some skewness still ‘at_least’ can lead to some skewness still

When to use “at_least_n_different_order_values” dense_rank() over “exactly” row_number(): - if we have multiple names with same count_distinct at the limit, we have no information to pick one vs the other (but ‘at_most’ is better here) - if we have multiple rows that are linked together, like exploded candidates list - if you have within an account more than n different names with the same exact order value

emm.indexing.spark_normalized_tfidf module¶

emm.indexing.spark_sni module¶

emm.indexing.spark_word_tokenizer module¶

Module contents¶

class emm.indexing.PandasCosSimIndexer(input_col='preprocessed', tokenizer='words', ngram=1, binary_countvectorizer=False, num_candidates=5, cos_sim_lower_bound=0.5, partition_size=5000, max_features=None, n_jobs=1, spark_session=None, blocking_func=None, dtype=<class 'numpy.float32'>, indexer_id=None)¶

Bases: TransformerMixin, CosSimBaseIndexer

Cosine similarity indexer to generate candidate name-pairs of possible matches

Parameters:

input_col (str)
tokenizer (Literal['words', 'characters'])
ngram (int)
binary_countvectorizer (bool)
num_candidates (int)
cos_sim_lower_bound (float)
partition_size (int)
max_features (Optional[int])
n_jobs (int)
spark_session (Optional[Any])
blocking_func (Union[Callable[[str], str], str, None])
dtype (type[float])
indexer_id (Optional[int])

calc_score(name1, name2)¶

Parameters:

name1 (Series)
name2 (Series)

Return type:

DataFrame

column_prefix()¶

Return type:: str

fit(X, y=None)¶

Fit the cosine similarity indexers to ground truth names

This creates TFIDF weights and matrix based on the ground truth names.

Args:: X: ground truth dataframe with preprocessed names. y: ignored.
Returns:: self

Parameters:

X (DataFrame)
y (Optional[Series])

Return type:

TransformerMixin

transform(X, multiple_indexers=None)¶

transform matches X dataset to the previously fitted ground truth.

Args:: X: Pandas dataframe with preprocessed names that should be matched multiple_indexers: ignored
Returns:: Pandas dataframe with the candidate matches returned by the indexer. Each row contains single pair of candidates. Columns gt_uid, uid contains index value from ground truth and X. Optionally id column (specified by self.uid_col) and carry on columns (specified by self.carry_on_cols) are copied from gt/X dataframes with the prefixes: gt_ or `. Any additional columns calculated by indexers are also preserved (i.e. score).

Parameters:

X (DataFrame)
multiple_indexers (Optional[bool])

Return type:

DataFrame

class emm.indexing.PandasNaiveIndexer(indexer_id=None)¶

Bases: TransformerMixin, BaseIndexer

Naive O(n^2) indexer for small datasets. Not for production use.

Parameters:: indexer_id (Optional[int])

fit(X, y=None)¶

Dummy function, no fitting required.

Parameters:

X (DataFrame)
y (Optional[Series])

Return type:

TransformerMixin

transform(X, spark_session=None, multiple_indexers=False)¶

Create all possible name-pairs

Args:: X: dataframe with (processed) input names to match to the ground truth. spark_session: ignored multiple_indexers: ignored
Returns:: dataframe with all possible candidate name pairs.

Parameters:

X (DataFrame)
spark_session (Optional[Any])
multiple_indexers (bool)

Return type:

DataFrame

class emm.indexing.PandasSortedNeighbourhoodIndexer(input_col='preprocessed', window_length=3, mapping_func=None, indexer_id=None)¶

Bases: TransformerMixin, SNBaseIndexer

Pandas transformer for sorted neighbourhood indexing

Parameters:

input_col (str)
window_length (int)
mapping_func (Optional[Callable[[str], str]])
indexer_id (Optional[int])

calc_score(name1, name2)¶

Parameters:

name1 (Series)
name2 (Series)

Return type:

DataFrame

column_prefix()¶

Return type:: str

fit(X, y=None)¶

Default Estimator action on fitting with ground truth names.

If custom mapping function is defined, then it is applied.

Args:: X: data frame with ground truth names y: ignored
Returns:: self

Parameters:

X (DataFrame)
y (Optional[Series])

Return type:

TransformerMixin

property store_ground_truth: bool¶

transform(X, multiple_indexers=False)¶

Default Model action on transforming names to match

Args:: X: dataframe with names to match. multiple_indexers: ignored.
Returns:: dataframe with candidate SNI name-pairs

Parameters:

X (DataFrame)
multiple_indexers (bool)

Return type:

DataFrame