emm.indexing package¶
Submodules¶
emm.indexing.base_indexer module¶
- class emm.indexing.base_indexer.BaseIndexer¶
Bases:
Module
Base implementation of Indexer class
- decrease_window_by_one_step()¶
Utility function for negative sample creation during training
This should change the parameter settings of the fitted model.
- increase_window_by_one_step()¶
Utility function for negative sample creation during training
This should change the parameter settings of the fitted model.
- static version()¶
- class emm.indexing.base_indexer.CosSimBaseIndexer(num_candidates)¶
Bases:
BaseIndexer
Base implementation of CosSimIndexer class
- Parameters:
num_candidates (
int
)
- decrease_window_by_one_step()¶
Utility function for negative sample creation during training
This changes the parameter settings of the fitted model.
- Return type:
None
- increase_window_by_one_step()¶
Utility function for negative sample creation during training
This changes the parameter settings of the fitted model.
- Return type:
None
- class emm.indexing.base_indexer.SNBaseIndexer(window_length)¶
Bases:
BaseIndexer
Base implementation of SN Indexer class
- Parameters:
window_length (
int
)
- decrease_window_by_one_step()¶
Utility function for negative sample creation during training
This changes the parameter settings of the fitted model.
- Return type:
None
- increase_window_by_one_step()¶
Utility function for negative sample creation during training
This changes the parameter settings of the fitted model.
- Return type:
None
emm.indexing.pandas_candidate_selection module¶
- class emm.indexing.pandas_candidate_selection.PandasCandidateSelectionTransformer(indexers, uid_col=None, carry_on_cols=None, with_no_matches=True)¶
Bases:
TransformerMixin
Pandas middleware class that aggregates candidate pairs for possible matches.
- Parameters:
indexers (
list
[BaseIndexer
])uid_col (
Optional
[str
])carry_on_cols (
Optional
[list
[str
]])with_no_matches (
bool
|None
)
- decrease_window_by_one_step()¶
Utility function for negative sample creation during training
- Return type:
None
- fit(X, y=None)¶
Fit the indexers to ground truth names
For example this creates TFIDF matrices for the cosine similarity indexers.
- Args:
X: ground truth dataframe with preprocessed names. y: ignored.
- Returns:
self
- Parameters:
X (
DataFrame
)y (
Optional
[Series
])
- Return type:
TransformerMixin
- fit_transform(X, y=None)¶
Tailored placeholder for fit_transform
Only calls fit(gt), this avoids the unnecessary transform gt during SklearnPipeline.fit_transform(gt).
The sklearn Pipeline is doing fit_transform for all stages excluding the last one, and with supervised model the CandidateSelection stage is an intermediate step.
- Args:
X: Pandas dataframe with names that are used to fit the indexers. y: ignored.
- Returns:
Pandas dataframe processed ground truth names.
- Parameters:
X (
DataFrame
)y (
Optional
[Series
])
- Return type:
DataFrame
- increase_window_by_one_step()¶
Utility function for negative sample creation during training
- Return type:
None
- transform(X)¶
transform matches X dataset to the previously fitted ground truth.
- Args:
X: Pandas dataframe with preprocessed names that should be matched
- Returns:
- Pandas dataframe with the candidate matches returned by indexers. Each row contains single pair of candidates.
Columns gt_uid, uid contains index value from ground truth and X. Optionally id column (specified by self.uid_col) and carry on columns (specified by self.carry_on_cols) are copied from gt/X dataframes with the prefixes: gt_ or `. Any additional columns calculated by indexers are also preserved (i.e. score).
- Parameters:
X (
DataFrame
)- Return type:
DataFrame
- emm.indexing.pandas_candidate_selection.select_with_prefix(df, cols, prefix)¶
- Parameters:
df (
DataFrame
)cols (
list
[str
])prefix (
str
)
- Return type:
DataFrame
emm.indexing.pandas_cos_sim_matcher module¶
- class emm.indexing.pandas_cos_sim_matcher.PandasCosSimIndexer(input_col='preprocessed', tokenizer='words', ngram=1, binary_countvectorizer=False, num_candidates=5, cos_sim_lower_bound=0.5, partition_size=5000, max_features=None, n_jobs=1, spark_session=None, blocking_func=None, dtype=<class 'numpy.float32'>, indexer_id=None)¶
Bases:
TransformerMixin
,CosSimBaseIndexer
Cosine similarity indexer to generate candidate name-pairs of possible matches
- Parameters:
input_col (
str
)tokenizer (
Literal
['words'
,'characters'
])ngram (
int
)binary_countvectorizer (
bool
)num_candidates (
int
)cos_sim_lower_bound (
float
)partition_size (
int
)max_features (
Optional
[int
])n_jobs (
int
)spark_session (
Optional
[Any
])blocking_func (
Union
[Callable
[[str
],str
],str
,None
])dtype (
type
[float
])indexer_id (
Optional
[int
])
- calc_score(name1, name2)¶
- Parameters:
name1 (
Series
)name2 (
Series
)
- Return type:
DataFrame
- column_prefix()¶
- Return type:
str
- fit(X, y=None)¶
Fit the cosine similarity indexers to ground truth names
This creates TFIDF weights and matrix based on the ground truth names.
- Args:
X: ground truth dataframe with preprocessed names. y: ignored.
- Returns:
self
- Parameters:
X (
DataFrame
)y (
Optional
[Series
])
- Return type:
TransformerMixin
- transform(X, multiple_indexers=None)¶
transform matches X dataset to the previously fitted ground truth.
- Args:
X: Pandas dataframe with preprocessed names that should be matched multiple_indexers: ignored
- Returns:
Pandas dataframe with the candidate matches returned by the indexer. Each row contains single pair of candidates. Columns gt_uid, uid contains index value from ground truth and X. Optionally id column (specified by self.uid_col) and carry on columns (specified by self.carry_on_cols) are copied from gt/X dataframes with the prefixes: gt_ or `. Any additional columns calculated by indexers are also preserved (i.e. score).
- Parameters:
X (
DataFrame
)multiple_indexers (
Optional
[bool
])
- Return type:
DataFrame
emm.indexing.pandas_naive_indexer module¶
- emm.indexing.pandas_naive_indexer.NaiveIndexer¶
alias of
PandasNaiveIndexer
- class emm.indexing.pandas_naive_indexer.PandasNaiveIndexer(indexer_id=None)¶
Bases:
TransformerMixin
,BaseIndexer
Naive O(n^2) indexer for small datasets. Not for production use.
- Parameters:
indexer_id (
Optional
[int
])
- fit(X, y=None)¶
Dummy function, no fitting required.
- Parameters:
X (
DataFrame
)y (
Optional
[Series
])
- Return type:
TransformerMixin
- transform(X, spark_session=None, multiple_indexers=False)¶
Create all possible name-pairs
- Args:
X: dataframe with (processed) input names to match to the ground truth. spark_session: ignored multiple_indexers: ignored
- Returns:
dataframe with all possible candidate name pairs.
- Parameters:
X (
DataFrame
)spark_session (
Optional
[Any
])multiple_indexers (
bool
)
- Return type:
DataFrame
emm.indexing.pandas_normalized_tfidf module¶
Customized TFIDF vectorization.
- class emm.indexing.pandas_normalized_tfidf.PandasNormalizedTfidfVectorizer(**kwargs)¶
Bases:
TfidfVectorizer
Implementation of customized TFIDF vectorizer
- Parameters:
kwargs (
Any
)
- fit(X)¶
Fit the TFIDF vectorizer.
- Args:
X: dataframe with preprocessed names
- Returns:
self
- Parameters:
X (
Series
|DataFrame
)- Return type:
TfidfVectorizer
- fit_transform(raw_documents, y=None)¶
Implementation of fit followed by transform
- Args:
raw_documents: dataframe with preprocessed input names. y: ignored.
- Returns:
normalized tfidf vectors of names.
- Parameters:
raw_documents (
Series
|DataFrame
)y (
Optional
[Any
])
- Return type:
csr_matrix
- transform(X)¶
Apply the fitted TFIDF vectorizer
- Args:
X: dataframe with preprocessed names
- Returns:
normalized tfidf vectors of names
- Parameters:
X (
Series
|DataFrame
)- Return type:
csr_matrix
- transform_parallel(X, n_jobs=-1)¶
Parallel apply the fitted TFIDF vectorizer
Inspired by: https://github.com/scikit-learn/scikit-learn/issues/7635#issuecomment-254407618
- Args:
X: dataframe with preprocessed names n_jobs: desired number of parallel jobs. default is all available cores.
- Returns:
normalized tfidf vectors of names
- Parameters:
X (
Series
|DataFrame
)n_jobs (
int
)
- Return type:
csr_matrix
emm.indexing.pandas_sni module¶
- class emm.indexing.pandas_sni.PandasSortedNeighbourhoodIndexer(input_col='preprocessed', window_length=3, mapping_func=None, indexer_id=None)¶
Bases:
TransformerMixin
,SNBaseIndexer
Pandas transformer for sorted neighbourhood indexing
- Parameters:
input_col (
str
)window_length (
int
)mapping_func (
Optional
[Callable
[[str
],str
]])indexer_id (
Optional
[int
])
- calc_score(name1, name2)¶
- Parameters:
name1 (
Series
)name2 (
Series
)
- Return type:
DataFrame
- column_prefix()¶
- Return type:
str
- fit(X, y=None)¶
Default Estimator action on fitting with ground truth names.
If custom mapping function is defined, then it is applied.
- Args:
X: data frame with ground truth names y: ignored
- Returns:
self
- Parameters:
X (
DataFrame
)y (
Optional
[Series
])
- Return type:
TransformerMixin
- property store_ground_truth: bool¶
- transform(X, multiple_indexers=False)¶
Default Model action on transforming names to match
- Args:
X: dataframe with names to match. multiple_indexers: ignored.
- Returns:
dataframe with candidate SNI name-pairs
- Parameters:
X (
DataFrame
)multiple_indexers (
bool
)
- Return type:
DataFrame
emm.indexing.spark_candidate_selection module¶
emm.indexing.spark_character_tokenizer module¶
emm.indexing.spark_cos_sim_matcher module¶
emm.indexing.spark_indexing_utils module¶
Helper function for name matching model save and load
- emm.indexing.spark_indexing_utils.as_matrix(vec, dense=False)¶
Convert a pyspark.ml.linalg.DenseVector to numpy matrix (only a single row) Convert a pyspark.ml.linalg.SparseVector to scipy.sparse.csr matrix (only a single row)
- Args:
vec: vector dense: bool
- Returns:
Numpy matrix / scipy csr matrix
- Parameters:
vec (DenseVector | SparseVector)
dense (bool)
- emm.indexing.spark_indexing_utils.collect_matrix(dist_matrix, uid_col, feature_col, blocking_col=None)¶
Convert a distributed matrix (spark.sql.Column of pyspark.ml.linalg.SparseVector) to a local matrix (scipy.sparse.csr matrix), and keep ground truth indices along with matrix: - the returned indices is a 1d np.array containing the ground-truth uid - the return matrix has the same integer position index as the indices In the blocking case it returns dicts where the key is the block and the value is the same as describe above.
- emm.indexing.spark_indexing_utils.curry(func, *args)¶
Curry a function so that only a single argument remains. This is required for rdd.mapPartitions()
- emm.indexing.spark_indexing_utils.dot_product(vec1, vec2)¶
Dot product of two pyspark.ml.linalg.SparseVector for example It works for pyspark.ml*.linalg.*Vector.dot
- Parameters:
vec1 (SparseVector | DenseVector)
vec2 (SparseVector | DenseVector)
- Return type:
float
- emm.indexing.spark_indexing_utils.down_casting_int(a)¶
Automatically downcast integer to the smallest int type according the minimum and maximum value of the array
- Parameters:
a (
array
)
- emm.indexing.spark_indexing_utils.explode_candidates(df, with_rank=True, separator='_')¶
Change data structure from one row per names_to_match with a list candidates to one row per candidate
- emm.indexing.spark_indexing_utils.flatten_df(nested_df, nested_cols, separator='_', keep_root_name=True)¶
Flatten all nested columns that are in nested_cols nested_cols: either one struct column or list of struct columns
- emm.indexing.spark_indexing_utils.stack_features(matrices, dense=False)¶
Combine multiple (>=1) feature matrices to a larger one
- emm.indexing.spark_indexing_utils.take_topn_per_group(df, n, group, order_by=None, method='exactly', keep_col=True)¶
Take only top-n rows per group to remove data skewness. order_by should be a tuple like: (F.col(‘C’), ) Method can have these values: ‘at_most’ can in some situation remove accounts ‘at_least_n_different_order_values’ can lead to some skewness still ‘at_least’ can lead to some skewness still
When to use “at_least_n_different_order_values” dense_rank() over “exactly” row_number(): - if we have multiple names with same count_distinct at the limit, we have no information to pick one vs the other (but ‘at_most’ is better here) - if we have multiple rows that are linked together, like exploded candidates list - if you have within an account more than n different names with the same exact order value
emm.indexing.spark_normalized_tfidf module¶
emm.indexing.spark_sni module¶
emm.indexing.spark_word_tokenizer module¶
Module contents¶
- class emm.indexing.PandasCosSimIndexer(input_col='preprocessed', tokenizer='words', ngram=1, binary_countvectorizer=False, num_candidates=5, cos_sim_lower_bound=0.5, partition_size=5000, max_features=None, n_jobs=1, spark_session=None, blocking_func=None, dtype=<class 'numpy.float32'>, indexer_id=None)¶
Bases:
TransformerMixin
,CosSimBaseIndexer
Cosine similarity indexer to generate candidate name-pairs of possible matches
- Parameters:
input_col (
str
)tokenizer (
Literal
['words'
,'characters'
])ngram (
int
)binary_countvectorizer (
bool
)num_candidates (
int
)cos_sim_lower_bound (
float
)partition_size (
int
)max_features (
Optional
[int
])n_jobs (
int
)spark_session (
Optional
[Any
])blocking_func (
Union
[Callable
[[str
],str
],str
,None
])dtype (
type
[float
])indexer_id (
Optional
[int
])
- calc_score(name1, name2)¶
- Parameters:
name1 (
Series
)name2 (
Series
)
- Return type:
DataFrame
- column_prefix()¶
- Return type:
str
- fit(X, y=None)¶
Fit the cosine similarity indexers to ground truth names
This creates TFIDF weights and matrix based on the ground truth names.
- Args:
X: ground truth dataframe with preprocessed names. y: ignored.
- Returns:
self
- Parameters:
X (
DataFrame
)y (
Optional
[Series
])
- Return type:
TransformerMixin
- transform(X, multiple_indexers=None)¶
transform matches X dataset to the previously fitted ground truth.
- Args:
X: Pandas dataframe with preprocessed names that should be matched multiple_indexers: ignored
- Returns:
Pandas dataframe with the candidate matches returned by the indexer. Each row contains single pair of candidates. Columns gt_uid, uid contains index value from ground truth and X. Optionally id column (specified by self.uid_col) and carry on columns (specified by self.carry_on_cols) are copied from gt/X dataframes with the prefixes: gt_ or `. Any additional columns calculated by indexers are also preserved (i.e. score).
- Parameters:
X (
DataFrame
)multiple_indexers (
Optional
[bool
])
- Return type:
DataFrame
- class emm.indexing.PandasNaiveIndexer(indexer_id=None)¶
Bases:
TransformerMixin
,BaseIndexer
Naive O(n^2) indexer for small datasets. Not for production use.
- Parameters:
indexer_id (
Optional
[int
])
- fit(X, y=None)¶
Dummy function, no fitting required.
- Parameters:
X (
DataFrame
)y (
Optional
[Series
])
- Return type:
TransformerMixin
- transform(X, spark_session=None, multiple_indexers=False)¶
Create all possible name-pairs
- Args:
X: dataframe with (processed) input names to match to the ground truth. spark_session: ignored multiple_indexers: ignored
- Returns:
dataframe with all possible candidate name pairs.
- Parameters:
X (
DataFrame
)spark_session (
Optional
[Any
])multiple_indexers (
bool
)
- Return type:
DataFrame
- class emm.indexing.PandasSortedNeighbourhoodIndexer(input_col='preprocessed', window_length=3, mapping_func=None, indexer_id=None)¶
Bases:
TransformerMixin
,SNBaseIndexer
Pandas transformer for sorted neighbourhood indexing
- Parameters:
input_col (
str
)window_length (
int
)mapping_func (
Optional
[Callable
[[str
],str
]])indexer_id (
Optional
[int
])
- calc_score(name1, name2)¶
- Parameters:
name1 (
Series
)name2 (
Series
)
- Return type:
DataFrame
- column_prefix()¶
- Return type:
str
- fit(X, y=None)¶
Default Estimator action on fitting with ground truth names.
If custom mapping function is defined, then it is applied.
- Args:
X: data frame with ground truth names y: ignored
- Returns:
self
- Parameters:
X (
DataFrame
)y (
Optional
[Series
])
- Return type:
TransformerMixin
- property store_ground_truth: bool¶
- transform(X, multiple_indexers=False)¶
Default Model action on transforming names to match
- Args:
X: dataframe with names to match. multiple_indexers: ignored.
- Returns:
dataframe with candidate SNI name-pairs
- Parameters:
X (
DataFrame
)multiple_indexers (
bool
)
- Return type:
DataFrame