emm.preprocessing package¶

Submodules¶

emm.preprocessing.abbreviation_util module¶

emm.preprocessing.abbreviation_util.abbr_match(str_with_abbr, str_with_open_form)¶: Checks if the second string has an open form of an abbreviation from the first string

emm.preprocessing.abbreviation_util.abbreviations_to_words(name)¶: Maps all the abbreviations to the same format (B. V. = B.V. = B.V = B V = BV)

emm.preprocessing.abbreviation_util.extract_abbr_merged_initials(abbr, name)¶: Extract possible open form of the given abbreviation if exists examples: (SK, Fenerbahce Spor Klubu) => Spor Klubu

emm.preprocessing.abbreviation_util.extract_abbr_merged_word_pieces(abbr, name)¶: Extract possible open form of the given abbreviation if exists examples: (PetroBras, Petroleo Brasileiro B.V.) => Petroleo Brasileiro

emm.preprocessing.abbreviation_util.find_abbr_merged_initials(name)¶: Finds abbreviations with merged initials examples: FC Barcelona => FC, ING BANK B.V. => BV

emm.preprocessing.abbreviation_util.find_abbr_merged_word_pieces(name)¶: Finds abbreviations with merged word pieces examples: PetroBras

emm.preprocessing.abbreviation_util.legal_abbreviations_to_words(name)¶: Maps all the abbreviations to the same format (B. V.= B.V. = B V = BV)

emm.preprocessing.abbreviation_util.preprocess(name)¶

emm.preprocessing.base_name_preprocessor module¶

This file provides several helper function for name preprocessing

As a user, you could use preprocess_name directly

class emm.preprocessing.base_name_preprocessor.AbstractPreprocessor(preprocess_pipeline='preprocess_merge_abbr', input_col='name', output_col='preprocessed', spark_session=None)¶

Bases: Module

Base class of Name Preprocessor

Parameters:

preprocess_pipeline (Any)
input_col (str)
output_col (str)
spark_session (Optional[Any])

create_func_dict()¶

Return type:: dict[str, Any]

emm.preprocessing.functions module¶

emm.preprocessing.functions.create_func_dict(use_spark=True)¶

Parameters:: use_spark (bool)
Return type:: dict[str, Union[Callable[[Any], Any], Callable[[str], str]]]

emm.preprocessing.functions.replace_none(name)¶

Parameters:: name (str | None)
Return type:: str

emm.preprocessing.pandas_functions module¶

emm.preprocessing.pandas_functions.lower(x)¶

Parameters:: x (Series)
Return type:: Series

emm.preprocessing.pandas_functions.regex_replace(pat, repl, simple=False)¶

Parameters:

pat (str)
repl (str)
simple (bool)

Return type:

Callable[[Series], Series]

emm.preprocessing.pandas_functions.run_custom_function(fn)¶

Return type:: Callable[[Series], Series]

emm.preprocessing.pandas_functions.trim(x)¶

Parameters:: x (Series)
Return type:: Series

emm.preprocessing.pandas_functions.trim_lower(x)¶

Parameters:: x (Series)
Return type:: Series

emm.preprocessing.pandas_preprocessor module¶

class emm.preprocessing.pandas_preprocessor.PandasPreprocessor(preprocess_pipeline='preprocess_merge_abbr', input_col='name', output_col='preprocessed', spark_session=None)¶

Bases: TransformerMixin, AbstractPreprocessor

Pandas implementation of Name Preprocessor

Parameters:

preprocess_pipeline (Any)
input_col (str)
output_col (str)
spark_session (Optional[Any])

create_func_dict()¶

Return type:: Mapping[str, Callable]

fit(*args, **kwargs)¶

Dummy function, this class does not require fitting

Args:: args: ignored. kwargs: ignored.
Returns:: self

Parameters:

args (Any)
kwargs (Any)

Return type:

TransformerMixin

fit_transform(X, y=None, **extra_params)¶

Perform preprocessing transform() of input names

Perform string cleaning, to-lower, remove punctuation and white spaces, convert legal entity forms to standard abbreviations.

Note this class does not require fitting, so not done.

Args:: X: dataframe containing input names. y: ignored. extra_params: extra parameters are passed on to transform() function.
Returns:: dataframe with preprocessed names

Parameters:

X (DataFrame)
y (Optional[Series])
extra_params (Any)

Return type:

DataFrame

transform(dataset, y=None)¶

Apply preprocessing functions to input names in dataframe

Perform string cleaning, to-lower, remove punctuation and white spaces, convert legal entity forms to standard abbreviations.

Args:: dataset: dataframe containing input names. y: ignored.
Returns:: dataframe with preprocessed names

Parameters:: dataset (DataFrame)
Return type:: DataFrame

emm.preprocessing.spark_functions module¶

emm.preprocessing.spark_preprocessor module¶

Module contents¶

class emm.preprocessing.PandasPreprocessor(preprocess_pipeline='preprocess_merge_abbr', input_col='name', output_col='preprocessed', spark_session=None)¶

Bases: TransformerMixin, AbstractPreprocessor

Pandas implementation of Name Preprocessor

Parameters:

preprocess_pipeline (Any)
input_col (str)
output_col (str)
spark_session (Optional[Any])

create_func_dict()¶

Return type:: Mapping[str, Callable]

fit(*args, **kwargs)¶

Dummy function, this class does not require fitting

Args:: args: ignored. kwargs: ignored.
Returns:: self

Parameters:

args (Any)
kwargs (Any)

Return type:

TransformerMixin

fit_transform(X, y=None, **extra_params)¶

Perform preprocessing transform() of input names

Perform string cleaning, to-lower, remove punctuation and white spaces, convert legal entity forms to standard abbreviations.

Note this class does not require fitting, so not done.

Args:: X: dataframe containing input names. y: ignored. extra_params: extra parameters are passed on to transform() function.
Returns:: dataframe with preprocessed names

Parameters:

X (DataFrame)
y (Optional[Series])
extra_params (Any)

Return type:

DataFrame

transform(dataset, y=None)¶

Apply preprocessing functions to input names in dataframe

Perform string cleaning, to-lower, remove punctuation and white spaces, convert legal entity forms to standard abbreviations.

Args:: dataset: dataframe containing input names. y: ignored.
Returns:: dataframe with preprocessed names

Parameters:: dataset (DataFrame)
Return type:: DataFrame