emm.preprocessing package¶
Submodules¶
emm.preprocessing.abbreviation_util module¶
- emm.preprocessing.abbreviation_util.abbr_match(str_with_abbr, str_with_open_form)¶
Checks if the second string has an open form of an abbreviation from the first string
- emm.preprocessing.abbreviation_util.abbreviations_to_words(name)¶
Maps all the abbreviations to the same format (B. V. = B.V. = B.V = B V = BV)
- emm.preprocessing.abbreviation_util.extract_abbr_merged_initials(abbr, name)¶
Extract possible open form of the given abbreviation if exists examples: (SK, Fenerbahce Spor Klubu) => Spor Klubu
- emm.preprocessing.abbreviation_util.extract_abbr_merged_word_pieces(abbr, name)¶
Extract possible open form of the given abbreviation if exists examples: (PetroBras, Petroleo Brasileiro B.V.) => Petroleo Brasileiro
- emm.preprocessing.abbreviation_util.find_abbr_merged_initials(name)¶
Finds abbreviations with merged initials examples: FC Barcelona => FC, ING BANK B.V. => BV
- emm.preprocessing.abbreviation_util.find_abbr_merged_word_pieces(name)¶
Finds abbreviations with merged word pieces examples: PetroBras
- emm.preprocessing.abbreviation_util.legal_abbreviations_to_words(name)¶
Maps all the abbreviations to the same format (B. V.= B.V. = B V = BV)
- emm.preprocessing.abbreviation_util.preprocess(name)¶
emm.preprocessing.base_name_preprocessor module¶
This file provides several helper function for name preprocessing
As a user, you could use preprocess_name directly
- class emm.preprocessing.base_name_preprocessor.AbstractPreprocessor(preprocess_pipeline='preprocess_merge_abbr', input_col='name', output_col='preprocessed', spark_session=None)¶
Bases:
ModuleBase class of Name Preprocessor
- Parameters:
preprocess_pipeline (
Any)input_col (
str)output_col (
str)spark_session (
Optional[Any])
- create_func_dict()¶
- Return type:
dict[str,Any]
emm.preprocessing.functions module¶
- emm.preprocessing.functions.create_func_dict(use_spark=True)¶
- Parameters:
use_spark (
bool)- Return type:
dict[str,Union[Callable[[Any],Any],Callable[[str],str]]]
- emm.preprocessing.functions.replace_none(name)¶
- Parameters:
name (
str|None)- Return type:
str
emm.preprocessing.pandas_functions module¶
- emm.preprocessing.pandas_functions.lower(x)¶
- Parameters:
x (
Series)- Return type:
Series
- emm.preprocessing.pandas_functions.regex_replace(pat, repl, simple=False)¶
- Parameters:
pat (
str)repl (
str)simple (
bool)
- Return type:
Callable[[Series],Series]
- emm.preprocessing.pandas_functions.run_custom_function(fn)¶
- Return type:
Callable[[Series],Series]
- emm.preprocessing.pandas_functions.trim(x)¶
- Parameters:
x (
Series)- Return type:
Series
- emm.preprocessing.pandas_functions.trim_lower(x)¶
- Parameters:
x (
Series)- Return type:
Series
emm.preprocessing.pandas_preprocessor module¶
- class emm.preprocessing.pandas_preprocessor.PandasPreprocessor(preprocess_pipeline='preprocess_merge_abbr', input_col='name', output_col='preprocessed', spark_session=None)¶
Bases:
TransformerMixin,AbstractPreprocessorPandas implementation of Name Preprocessor
- Parameters:
preprocess_pipeline (
Any)input_col (
str)output_col (
str)spark_session (
Optional[Any])
- create_func_dict()¶
- Return type:
Mapping[str,Callable]
- fit(*args, **kwargs)¶
Dummy function, this class does not require fitting
- Args:
args: ignored. kwargs: ignored.
- Returns:
self
- Parameters:
args (
Any)kwargs (
Any)
- Return type:
TransformerMixin
- fit_transform(X, y=None, **extra_params)¶
Perform preprocessing transform() of input names
Perform string cleaning, to-lower, remove punctuation and white spaces, convert legal entity forms to standard abbreviations.
Note this class does not require fitting, so not done.
- Args:
X: dataframe containing input names. y: ignored. extra_params: extra parameters are passed on to transform() function.
- Returns:
dataframe with preprocessed names
- Parameters:
X (
DataFrame)y (
Optional[Series])extra_params (
Any)
- Return type:
DataFrame
- transform(dataset, y=None)¶
Apply preprocessing functions to input names in dataframe
Perform string cleaning, to-lower, remove punctuation and white spaces, convert legal entity forms to standard abbreviations.
- Args:
dataset: dataframe containing input names. y: ignored.
- Returns:
dataframe with preprocessed names
- Parameters:
dataset (
DataFrame)- Return type:
DataFrame
emm.preprocessing.spark_functions module¶
emm.preprocessing.spark_preprocessor module¶
Module contents¶
- class emm.preprocessing.PandasPreprocessor(preprocess_pipeline='preprocess_merge_abbr', input_col='name', output_col='preprocessed', spark_session=None)¶
Bases:
TransformerMixin,AbstractPreprocessorPandas implementation of Name Preprocessor
- Parameters:
preprocess_pipeline (
Any)input_col (
str)output_col (
str)spark_session (
Optional[Any])
- create_func_dict()¶
- Return type:
Mapping[str,Callable]
- fit(*args, **kwargs)¶
Dummy function, this class does not require fitting
- Args:
args: ignored. kwargs: ignored.
- Returns:
self
- Parameters:
args (
Any)kwargs (
Any)
- Return type:
TransformerMixin
- fit_transform(X, y=None, **extra_params)¶
Perform preprocessing transform() of input names
Perform string cleaning, to-lower, remove punctuation and white spaces, convert legal entity forms to standard abbreviations.
Note this class does not require fitting, so not done.
- Args:
X: dataframe containing input names. y: ignored. extra_params: extra parameters are passed on to transform() function.
- Returns:
dataframe with preprocessed names
- Parameters:
X (
DataFrame)y (
Optional[Series])extra_params (
Any)
- Return type:
DataFrame
- transform(dataset, y=None)¶
Apply preprocessing functions to input names in dataframe
Perform string cleaning, to-lower, remove punctuation and white spaces, convert legal entity forms to standard abbreviations.
- Args:
dataset: dataframe containing input names. y: ignored.
- Returns:
dataframe with preprocessed names
- Parameters:
dataset (
DataFrame)- Return type:
DataFrame