emm.preprocessing package

Submodules

emm.preprocessing.abbreviation_util module

emm.preprocessing.abbreviation_util.abbr_match(str_with_abbr, str_with_open_form)

Checks if the second string has an open form of an abbreviation from the first string

emm.preprocessing.abbreviation_util.abbreviations_to_words(name)

Maps all the abbreviations to the same format (B. V. = B.V. = B.V = B V = BV)

emm.preprocessing.abbreviation_util.extract_abbr_merged_initials(abbr, name)

Extract possible open form of the given abbreviation if exists examples: (SK, Fenerbahce Spor Klubu) => Spor Klubu

emm.preprocessing.abbreviation_util.extract_abbr_merged_word_pieces(abbr, name)

Extract possible open form of the given abbreviation if exists examples: (PetroBras, Petroleo Brasileiro B.V.) => Petroleo Brasileiro

emm.preprocessing.abbreviation_util.find_abbr_merged_initials(name)

Finds abbreviations with merged initials examples: FC Barcelona => FC, ING BANK B.V. => BV

emm.preprocessing.abbreviation_util.find_abbr_merged_word_pieces(name)

Finds abbreviations with merged word pieces examples: PetroBras

emm.preprocessing.abbreviation_util.legal_abbreviations_to_words(name)

Maps all the abbreviations to the same format (B. V.= B.V. = B V = BV)

emm.preprocessing.abbreviation_util.preprocess(name)

emm.preprocessing.base_name_preprocessor module

This file provides several helper function for name preprocessing

As a user, you could use preprocess_name directly

class emm.preprocessing.base_name_preprocessor.AbstractPreprocessor(preprocess_pipeline='preprocess_merge_abbr', input_col='name', output_col='preprocessed', spark_session=None)

Bases: Module

Base class of Name Preprocessor

Parameters:
  • preprocess_pipeline (Any)

  • input_col (str)

  • output_col (str)

  • spark_session (Optional[Any])

create_func_dict()
Return type:

dict[str, Any]

emm.preprocessing.functions module

emm.preprocessing.functions.create_func_dict(use_spark=True)
Parameters:

use_spark (bool)

Return type:

dict[str, Union[Callable[[Any], Any], Callable[[str], str]]]

emm.preprocessing.functions.replace_none(name)
Parameters:

name (str | None)

Return type:

str

emm.preprocessing.pandas_functions module

emm.preprocessing.pandas_functions.lower(x)
Parameters:

x (Series)

Return type:

Series

emm.preprocessing.pandas_functions.regex_replace(pat, repl, simple=False)
Parameters:
  • pat (str)

  • repl (str)

  • simple (bool)

Return type:

Callable[[Series], Series]

emm.preprocessing.pandas_functions.run_custom_function(fn)
Return type:

Callable[[Series], Series]

emm.preprocessing.pandas_functions.trim(x)
Parameters:

x (Series)

Return type:

Series

emm.preprocessing.pandas_functions.trim_lower(x)
Parameters:

x (Series)

Return type:

Series

emm.preprocessing.pandas_preprocessor module

class emm.preprocessing.pandas_preprocessor.PandasPreprocessor(preprocess_pipeline='preprocess_merge_abbr', input_col='name', output_col='preprocessed', spark_session=None)

Bases: TransformerMixin, AbstractPreprocessor

Pandas implementation of Name Preprocessor

Parameters:
  • preprocess_pipeline (Any)

  • input_col (str)

  • output_col (str)

  • spark_session (Optional[Any])

create_func_dict()
Return type:

Mapping[str, Callable]

fit(*args, **kwargs)

Dummy function, this class does not require fitting

Args:

args: ignored. kwargs: ignored.

Returns:

self

Parameters:
  • args (Any)

  • kwargs (Any)

Return type:

TransformerMixin

fit_transform(X, y=None, **extra_params)

Perform preprocessing transform() of input names

Perform string cleaning, to-lower, remove punctuation and white spaces, convert legal entity forms to standard abbreviations.

Note this class does not require fitting, so not done.

Args:

X: dataframe containing input names. y: ignored. extra_params: extra parameters are passed on to transform() function.

Returns:

dataframe with preprocessed names

Parameters:
  • X (DataFrame)

  • y (Optional[Series])

  • extra_params (Any)

Return type:

DataFrame

transform(dataset, y=None)

Apply preprocessing functions to input names in dataframe

Perform string cleaning, to-lower, remove punctuation and white spaces, convert legal entity forms to standard abbreviations.

Args:

dataset: dataframe containing input names. y: ignored.

Returns:

dataframe with preprocessed names

Parameters:

dataset (DataFrame)

Return type:

DataFrame

emm.preprocessing.spark_functions module

emm.preprocessing.spark_preprocessor module

Module contents

class emm.preprocessing.PandasPreprocessor(preprocess_pipeline='preprocess_merge_abbr', input_col='name', output_col='preprocessed', spark_session=None)

Bases: TransformerMixin, AbstractPreprocessor

Pandas implementation of Name Preprocessor

Parameters:
  • preprocess_pipeline (Any)

  • input_col (str)

  • output_col (str)

  • spark_session (Optional[Any])

create_func_dict()
Return type:

Mapping[str, Callable]

fit(*args, **kwargs)

Dummy function, this class does not require fitting

Args:

args: ignored. kwargs: ignored.

Returns:

self

Parameters:
  • args (Any)

  • kwargs (Any)

Return type:

TransformerMixin

fit_transform(X, y=None, **extra_params)

Perform preprocessing transform() of input names

Perform string cleaning, to-lower, remove punctuation and white spaces, convert legal entity forms to standard abbreviations.

Note this class does not require fitting, so not done.

Args:

X: dataframe containing input names. y: ignored. extra_params: extra parameters are passed on to transform() function.

Returns:

dataframe with preprocessed names

Parameters:
  • X (DataFrame)

  • y (Optional[Series])

  • extra_params (Any)

Return type:

DataFrame

transform(dataset, y=None)

Apply preprocessing functions to input names in dataframe

Perform string cleaning, to-lower, remove punctuation and white spaces, convert legal entity forms to standard abbreviations.

Args:

dataset: dataframe containing input names. y: ignored.

Returns:

dataframe with preprocessed names

Parameters:

dataset (DataFrame)

Return type:

DataFrame