emm.features package

Submodules

emm.features.base_feature_extractor module

class emm.features.base_feature_extractor.BaseFeatureExtractor

Bases: Module

emm.features.features_extra module

emm.features.features_extra.calc_extra_features(df, features)

Compute features for provided column

Args:

df: the input dataframe features: a list of strings indicating column names (for exact matches), a tuple with column name and function

Returns:

Feature dataframe

Parameters:
  • df (DataFrame)

  • features (list[str | tuple[str, Callable]])

Return type:

DataFrame

emm.features.features_lef module

emm.features.features_lef.calc_lef_features(df, name1='preprocessed', name2='gt_preprocessed', business_type=False, detailed_match=False)

Determine legal entity form-based features of both names using cleanco

Args:

df: candidates dataframe. name1: column of name1, default is “preprocessed”. name2: column of name1, default is “gt_preprocessed”. business_type: if True, determine match of general international business type (from LEF). detailed_match: if True, store both legal entity forms (and possibly business types). n_jobs: desired number of parallel jobs. default is 1.

Returns:

dataframe with match of legal entity forms.

Parameters:
  • df (DataFrame)

  • name1 (str)

  • name2 (str)

  • business_type (bool)

  • detailed_match (bool)

Return type:

DataFrame

emm.features.features_lef.custom_basename_and_lef(name, terms=[(4, ['gmbh', '&', 'co', 'kg']), (4, ['mbh', '&', 'co', 'kg']), (3, ['c', 'por', 'a']), (3, ['s', 'de', 'rl']), (3, ['s', 'de', 'rl']), (3, ['s', 'en', 'c']), (3, ['sm', 'pte', 'ltd']), (3, ['sp', 'z', 'oo']), (3, ['spol', 's', 'ro']), (3, ['spolka', 'z', 'oo']), (3, ['suc', 'de', 'descendants']), (3, ['vea\\x99', 'obch', 'spol']), (2, ['&', 'co']), (2, ['&', 'co']), (2, ['&', 'company']), (2, ['a', 'spol']), (2, ['akc', 'spol']), (2, ['and', 'company']), (2, ['as', 'oy']), (2, ['kom', 'spol']), (2, ['plc', 'ltd']), (2, ['pte', 'ltd']), (2, ['pte', 'ltd']), (2, ['pte', 'ltd']), (2, ['pty', 'ltd']), (2, ['pty', 'ltd']), (2, ['pvt', 'ltd']), (2, ['sce', 'i']), (2, ['sdn', 'bhd']), (2, ['sdn', 'bhd']), (2, ['sp', 'zoo']), (1, ['3ao']), (1, ['3at']), (1, ['a/s']), (1, ['aat']), (1, ['ab']), (1, ['ad']), (1, ['ad']), (1, ['adsitz']), (1, ['ae']), (1, ['ae']), (1, ['ag']), (1, ['ag']), (1, ['aj']), (1, ['amba']), (1, ['amba']), (1, ['ans']), (1, ['aps']), (1, ['as']), (1, ['as']), (1, ['asa']), (1, ['asoy']), (1, ['at']), (1, ['ay']), (1, ['ba']), (1, ['bhd']), (1, ['bhd']), (1, ['bl']), (1, ['bm']), (1, ['bm']), (1, ['bt']), (1, ['bv']), (1, ['bvba']), (1, ['ca']), (1, ['cic']), (1, ['cio']), (1, ['co']), (1, ['co']), (1, ['commv']), (1, ['company']), (1, ['coop']), (1, ['corp']), (1, ['corp']), (1, ['corporation']), (1, ['cpt']), (1, ['crl']), (1, ['cv']), (1, ['cvba']), (1, ['cvoa']), (1, ['cxa']), (1, ['da']), (1, ['dat']), (1, ['dd']), (1, ['dno']), (1, ['doo']), (1, ['dooel']), (1, ['dooel']), (1, ['ead']), (1, ['ec']), (1, ['ec']), (1, ['ee']), (1, ['ee']), (1, ['eg']), (1, ['ehf']), (1, ['ei']), (1, ['eirl']), (1, ['eirl']), (1, ['ent']), (1, ['ep']), (1, ['epe']), (1, ['epe']), (1, ['esv']), (1, ['et']), (1, ['etat']), (1, ['eu']), (1, ['eurl']), (1, ['ev']), (1, ['ev']), (1, ['fa']), (1, ['fcp']), (1, ['fie']), (1, ['fkf']), (1, ['fmba']), (1, ['fmba']), (1, ['fop']), (1, ['g/s']), (1, ['gbr']), (1, ['gesbr']), (1, ['gie']), (1, ['gmbh']), (1, ['gmbh']), (1, ['gmbh']), (1, ['gp']), (1, ['gte']), (1, ['hb']), (1, ['hf']), (1, ['hf']), (1, ['i/s']), (1, ['i/s']), (1, ['ij']), (1, ['ik']), (1, ['iks']), (1, ['inc']), (1, ['inc']), (1, ['incorporated']), (1, ['jtd']), (1, ['k/s']), (1, ['kb']), (1, ['kd']), (1, ['kd']), (1, ['kda']), (1, ['kda']), (1, ['kf']), (1, ['kft']), (1, ['kg']), (1, ['kgaa']), (1, ['kht']), (1, ['kkt']), (1, ['koop']), (1, ['ks']), (1, ['ks']), (1, ['kt']), (1, ['kv']), (1, ['kv']), (1, ['ky']), (1, ['lda']), (1, ['limited']), (1, ['llc']), (1, ['llc']), (1, ['lllp']), (1, ['lllp']), (1, ['llp']), (1, ['llp']), (1, ['lp']), (1, ['lp']), (1, ['ltd']), (1, ['ltd']), (1, ['ltd']), (1, ['ltda']), (1, ['ltda']), (1, ['mb']), (1, ['mchj']), (1, ['mepe']), (1, ['mepe']), (1, ['nl']), (1, ['nuf']), (1, ['nv']), (1, ['nv']), (1, ['nyrt']), (1, ['nyrt']), (1, ['oaj']), (1, ['oao']), (1, ['obrt']), (1, ['od']), (1, ['oe']), (1, ['oe']), (1, ['og']), (1, ['ohf']), (1, ['ohg']), (1, ['ok']), (1, ['ong']), (1, ['ood']), (1, ['ooo']), (1, ['ovee']), (1, ['ovee']), (1, ['oy']), (1, ['oyj']), (1, ['p/s']), (1, ['partg']), (1, ['pc']), (1, ['peec']), (1, ['plc']), (1, ['plc']), (1, ['plc']), (1, ['pllc']), (1, ['pllc']), (1, ['pp']), (1, ['pp']), (1, ['private']), (1, ['ps']), (1, ['pse']), (1, ['psu']), (1, ['pt']), (1, ['pte']), (1, ['qk']), (1, ['qmj']), (1, ['rhf']), (1, ['rt']), (1, ['sa']), (1, ['sa']), (1, ['saa']), (1, ['sab']), (1, ['sad']), (1, ['sae']), (1, ['sagl']), (1, ['sal']), (1, ['saoc']), (1, ['saog']), (1, ['sapa']), (1, ['sapi']), (1, ['sarl']), (1, ['sarl']), (1, ['sas']), (1, ['sas']), (1, ['sasu']), (1, ['sc']), (1, ['sca']), (1, ['sca']), (1, ['sccl']), (1, ['scoop']), (1, ['scop']), (1, ['scpa']), (1, ['scpa']), (1, ['scra']), (1, ['scra']), (1, ['scrl']), (1, ['scs']), (1, ['scs']), (1, ['sd']), (1, ['se']), (1, ['secs']), (1, ['sem']), (1, ['sep']), (1, ['ses']), (1, ['sf']), (1, ['sf']), (1, ['sgps']), (1, ['sgr']), (1, ['sgr']), (1, ['sgr']), (1, ['sha']), (1, ['shpk']), (1, ['sia']), (1, ['sicav']), (1, ['ska']), (1, ['sl']), (1, ['sll']), (1, ['slne']), (1, ['smba']), (1, ['smba']), (1, ['snc']), (1, ['snc']), (1, ['soccol']), (1, ['sogepa']), (1, ['sp']), (1, ['sp']), (1, ['spa']), (1, ['spj']), (1, ['spk']), (1, ['spp']), (1, ['sprl']), (1, ['srl']), (1, ['srl']), (1, ['srl']), (1, ['sro']), (1, ['ss']), (1, ['stg']), (1, ['t:mi']), (1, ['tapui']), (1, ['tdv']), (1, ['teo']), (1, ['tmi']), (1, ['tov']), (1, ['uab']), (1, ['ud']), (1, ['uk']), (1, ['ultd']), (1, ['ultd']), (1, ['unlimited']), (1, ['unltd']), (1, ['vat']), (1, ['vof']), (1, ['vof']), (1, ['vos']), (1, ['vzw']), (1, ['xk']), (1, ['xt']), (1, ['xxk']), (1, ['yoaj']), (1, ['zao']), (1, ['zat']), (1, ['zrt']), (1, ['оао']), (1, ['ооо']), (1, ['пао'])], suffix=True, prefix=False, middle=False, return_lef=False)

Return cleaned base version of the business name and legal entity form

Same as cleanco.clean.custom_basename(), but also return legal entity form(s).

Args:

name: business name to clean terms: legal entity forms to search for. suffix: remove legal entity forms from suffix of name. default is True. prefix: remove legal entity forms from prefix of name. default is False. middle: remove legal entity forms from middle of name. default is False. return_lef: default is False.

Returns:

basename and list with list with legal entity forms

Parameters:
  • name (str)

  • suffix (bool)

  • prefix (bool)

  • middle (bool)

  • return_lef (bool)

emm.features.features_lef.extract_lef(name, terms=[(4, ['gmbh', '&', 'co', 'kg']), (4, ['mbh', '&', 'co', 'kg']), (3, ['c', 'por', 'a']), (3, ['s', 'de', 'rl']), (3, ['s', 'de', 'rl']), (3, ['s', 'en', 'c']), (3, ['sm', 'pte', 'ltd']), (3, ['sp', 'z', 'oo']), (3, ['spol', 's', 'ro']), (3, ['spolka', 'z', 'oo']), (3, ['suc', 'de', 'descendants']), (3, ['vea\x99', 'obch', 'spol']), (2, ['&', 'co']), (2, ['&', 'co']), (2, ['&', 'company']), (2, ['a', 'spol']), (2, ['akc', 'spol']), (2, ['and', 'company']), (2, ['as', 'oy']), (2, ['kom', 'spol']), (2, ['plc', 'ltd']), (2, ['pte', 'ltd']), (2, ['pte', 'ltd']), (2, ['pte', 'ltd']), (2, ['pty', 'ltd']), (2, ['pty', 'ltd']), (2, ['pvt', 'ltd']), (2, ['sce', 'i']), (2, ['sdn', 'bhd']), (2, ['sdn', 'bhd']), (2, ['sp', 'zoo']), (1, ['3ao']), (1, ['3at']), (1, ['a/s']), (1, ['aat']), (1, ['ab']), (1, ['ad']), (1, ['ad']), (1, ['adsitz']), (1, ['ae']), (1, ['ae']), (1, ['ag']), (1, ['ag']), (1, ['aj']), (1, ['amba']), (1, ['amba']), (1, ['ans']), (1, ['aps']), (1, ['as']), (1, ['as']), (1, ['asa']), (1, ['asoy']), (1, ['at']), (1, ['ay']), (1, ['ba']), (1, ['bhd']), (1, ['bhd']), (1, ['bl']), (1, ['bm']), (1, ['bm']), (1, ['bt']), (1, ['bv']), (1, ['bvba']), (1, ['ca']), (1, ['cic']), (1, ['cio']), (1, ['co']), (1, ['co']), (1, ['commv']), (1, ['company']), (1, ['coop']), (1, ['corp']), (1, ['corp']), (1, ['corporation']), (1, ['cpt']), (1, ['crl']), (1, ['cv']), (1, ['cvba']), (1, ['cvoa']), (1, ['cxa']), (1, ['da']), (1, ['dat']), (1, ['dd']), (1, ['dno']), (1, ['doo']), (1, ['dooel']), (1, ['dooel']), (1, ['ead']), (1, ['ec']), (1, ['ec']), (1, ['ee']), (1, ['ee']), (1, ['eg']), (1, ['ehf']), (1, ['ei']), (1, ['eirl']), (1, ['eirl']), (1, ['ent']), (1, ['ep']), (1, ['epe']), (1, ['epe']), (1, ['esv']), (1, ['et']), (1, ['etat']), (1, ['eu']), (1, ['eurl']), (1, ['ev']), (1, ['ev']), (1, ['fa']), (1, ['fcp']), (1, ['fie']), (1, ['fkf']), (1, ['fmba']), (1, ['fmba']), (1, ['fop']), (1, ['g/s']), (1, ['gbr']), (1, ['gesbr']), (1, ['gie']), (1, ['gmbh']), (1, ['gmbh']), (1, ['gmbh']), (1, ['gp']), (1, ['gte']), (1, ['hb']), (1, ['hf']), (1, ['hf']), (1, ['i/s']), (1, ['i/s']), (1, ['ij']), (1, ['ik']), (1, ['iks']), (1, ['inc']), (1, ['inc']), (1, ['incorporated']), (1, ['jtd']), (1, ['k/s']), (1, ['kb']), (1, ['kd']), (1, ['kd']), (1, ['kda']), (1, ['kda']), (1, ['kf']), (1, ['kft']), (1, ['kg']), (1, ['kgaa']), (1, ['kht']), (1, ['kkt']), (1, ['koop']), (1, ['ks']), (1, ['ks']), (1, ['kt']), (1, ['kv']), (1, ['kv']), (1, ['ky']), (1, ['lda']), (1, ['limited']), (1, ['llc']), (1, ['llc']), (1, ['lllp']), (1, ['lllp']), (1, ['llp']), (1, ['llp']), (1, ['lp']), (1, ['lp']), (1, ['ltd']), (1, ['ltd']), (1, ['ltd']), (1, ['ltda']), (1, ['ltda']), (1, ['mb']), (1, ['mchj']), (1, ['mepe']), (1, ['mepe']), (1, ['nl']), (1, ['nuf']), (1, ['nv']), (1, ['nv']), (1, ['nyrt']), (1, ['nyrt']), (1, ['oaj']), (1, ['oao']), (1, ['obrt']), (1, ['od']), (1, ['oe']), (1, ['oe']), (1, ['og']), (1, ['ohf']), (1, ['ohg']), (1, ['ok']), (1, ['ong']), (1, ['ood']), (1, ['ooo']), (1, ['ovee']), (1, ['ovee']), (1, ['oy']), (1, ['oyj']), (1, ['p/s']), (1, ['partg']), (1, ['pc']), (1, ['peec']), (1, ['plc']), (1, ['plc']), (1, ['plc']), (1, ['pllc']), (1, ['pllc']), (1, ['pp']), (1, ['pp']), (1, ['private']), (1, ['ps']), (1, ['pse']), (1, ['psu']), (1, ['pt']), (1, ['pte']), (1, ['qk']), (1, ['qmj']), (1, ['rhf']), (1, ['rt']), (1, ['sa']), (1, ['sa']), (1, ['saa']), (1, ['sab']), (1, ['sad']), (1, ['sae']), (1, ['sagl']), (1, ['sal']), (1, ['saoc']), (1, ['saog']), (1, ['sapa']), (1, ['sapi']), (1, ['sarl']), (1, ['sarl']), (1, ['sas']), (1, ['sas']), (1, ['sasu']), (1, ['sc']), (1, ['sca']), (1, ['sca']), (1, ['sccl']), (1, ['scoop']), (1, ['scop']), (1, ['scpa']), (1, ['scpa']), (1, ['scra']), (1, ['scra']), (1, ['scrl']), (1, ['scs']), (1, ['scs']), (1, ['sd']), (1, ['se']), (1, ['secs']), (1, ['sem']), (1, ['sep']), (1, ['ses']), (1, ['sf']), (1, ['sf']), (1, ['sgps']), (1, ['sgr']), (1, ['sgr']), (1, ['sgr']), (1, ['sha']), (1, ['shpk']), (1, ['sia']), (1, ['sicav']), (1, ['ska']), (1, ['sl']), (1, ['sll']), (1, ['slne']), (1, ['smba']), (1, ['smba']), (1, ['snc']), (1, ['snc']), (1, ['soccol']), (1, ['sogepa']), (1, ['sp']), (1, ['sp']), (1, ['spa']), (1, ['spj']), (1, ['spk']), (1, ['spp']), (1, ['sprl']), (1, ['srl']), (1, ['srl']), (1, ['srl']), (1, ['sro']), (1, ['ss']), (1, ['stg']), (1, ['t:mi']), (1, ['tapui']), (1, ['tdv']), (1, ['teo']), (1, ['tmi']), (1, ['tov']), (1, ['uab']), (1, ['ud']), (1, ['uk']), (1, ['ultd']), (1, ['ultd']), (1, ['unlimited']), (1, ['unltd']), (1, ['vat']), (1, ['vof']), (1, ['vof']), (1, ['vos']), (1, ['vzw']), (1, ['xk']), (1, ['xt']), (1, ['xxk']), (1, ['yoaj']), (1, ['zao']), (1, ['zat']), (1, ['zrt']), (1, ['оао']), (1, ['ооо']), (1, ['пао'])], suffix=True, prefix=False, middle=False, return_lef=True)

Extract legal entity form(s) from business name.

Same as custom_basename_and_lef(), but returns no basename.

Args:

name: business name to clean terms: legal entity forms to search for. suffix: remove legal entity forms from suffix of name. default is True. prefix: remove legal entity forms from prefix of name. default is False. middle: remove legal entity forms from middle of name. default is False. return_lef: default is True.

Returns:

joined string of legal entity forms found

emm.features.features_lef.get_business_type(joined_lef, types_by_lef={'': ['no_lef'], ' i/s': ['General Partnership'], '& co': ['Corporation'], '3at': ['Limited Liability Company'], 'a spol': ['General Partnership'], 'a/s': ['Limited Liability Company'], 'aat': ['Limited Liability Company'], 'ab': ['Limited Liability Company'], 'ad': ['Limited Liability Company'], 'ae': ['Limited Liability Company'], 'ag': ['Corporation'], 'aj': ['Joint Stock / Unlimited'], 'akc spol': ['Joint Stock / Unlimited'], 'ans': ['General Partnership'], 'aps': ['Limited Liability Company'], 'as': ['Joint Stock / Unlimited', 'Limited', 'Limited Liability Company'], 'asa': ['Limited Liability Company'], 'ay': ['General Partnership'], 'bhd': ['Limited'], 'bt': ['General Partnership'], 'bv': ['Limited'], 'bvba': ['Limited'], 'co': ['Corporation'], 'commv': ['Limited Partnership'], 'company': ['Corporation'], 'corp': ['Corporation'], 'corporation': ['Corporation'], 'cpt': ['Limited Liability Company'], 'cv': ['Limited Partnership'], 'da': ['General Partnership'], 'dat': ['Limited Liability Company'], 'dd': ['Limited Liability Company'], 'dno': ['General Partnership'], 'doo': ['Limited'], 'dooel': ['Limited'], 'ec': ['Sole Proprietorship'], 'ee': ['Limited Partnership'], 'ehf': ['Limited'], 'esv': ['Joint Venture'], 'et': ['Sole Proprietorship'], 'eu': ['Sole Proprietorship'], 'eurl': ['Limited Liability Company'], 'ev': ['Sole Proprietorship'], 'fie': ['Sole Proprietorship'], 'fop': ['Sole Proprietorship'], 'gie': ['Joint Venture'], 'gmbh': ['Limited'], 'gmbh & co kg': ['Limited Partnership'], 'gte': ['Non-Profit'], 'hb': ['General Partnership'], 'hf': ['Limited Liability Company'], 'ij': ['Sole Proprietorship'], 'inc': ['Corporation'], 'incorporated': ['Corporation'], 'jtd': ['General Partnership'], 'k/s': ['Limited Partnership'], 'kb': ['Limited Partnership'], 'kd': ['Limited Partnership'], 'kda': ['Limited Partnership'], 'kft': ['Limited'], 'kg': ['Limited Partnership'], 'kgaa': ['General Partnership'], 'kht': ['Limited'], 'ks': ['Limited Partnership'], 'kt': ['Limited Partnership'], 'kv': ['Joint Venture'], 'ky': ['Limited Partnership'], 'lda': ['Limited'], 'limited': ['Limited'], 'llc': ['Limited Liability Company'], 'lllp': ['Limited Liability Limited Partnership'], 'llp': ['Limited Liability Partnership'], 'lp': ['Limited Partnership'], 'ltd': ['Limited'], 'ltda': ['General Partnership', 'Limited'], 'mb': ['General Partnership'], 'mchj': ['Limited Liability Company'], 'nl': ['No Liability'], 'nuf': ['Corporation'], 'nv': ['Corporation', 'Limited Liability Company'], 'nyrt': ['Limited Liability Company'], 'oaj': ['Joint Stock / Unlimited'], 'oao': ['Corporation'], 'obrt': ['Sole Proprietorship'], 'od': ['General Partnership'], 'oe': ['General Partnership'], 'og': ['General Partnership'], 'ood': ['Limited'], 'ooo': ['Limited Liability Company'], 'oy': ['Limited'], 'oyj': ['Limited Liability Company'], 'p/s': ['Limited Liability Company'], 'pc': ['Professional Corporation'], 'plc': ['Limited Liability Company'], 'pllc': ['Limited Liability Company', 'Professional Limited Liability Company'], 'pp': ['Limited'], 'private': ['Private Company'], 'pt': ['General Partnership'], 'pte': ['Private Company'], 'pty ltd': ['Limited'], 'qk': ['Joint Venture'], 'rt': ['Limited'], 's de rl': ['Limited'], 's en c': ['Limited Partnership'], 'sa': ['Corporation', 'Limited Liability Company'], 'sae': ['Limited Liability Company'], 'sal': ['Joint Stock / Unlimited'], 'saoc': ['Joint Stock / Unlimited'], 'saog': ['Joint Stock / Unlimited'], 'sapa': ['General Partnership'], 'sarl': ['Limited', 'Limited Liability Company'], 'sas': ['Limited Partnership'], 'sasu': ['Limited Liability Company'], 'sca': ['Limited Liability Partnership'], 'scpa': ['Limited Partnership'], 'scra': ['Limited Partnership'], 'scs': ['Limited', 'Limited Liability Partnership', 'Limited Partnership'], 'sd': ['General Partnership'], 'sdn bhd': ['Limited'], 'secs': ['Limited Partnership'], 'ses': ['Non-Profit'], 'sf': ['Corporation', 'General Partnership'], 'sha': ['Limited Liability Company'], 'sicav': ['Mutual Fund'], 'ska': ['Limited Partnership'], 'sl': ['Limited'], 'slne': ['Limited'], 'smba': ['Limited Liability Company'], 'snc': ['General Partnership', 'Professional Corporation'], 'soccol': ['General Partnership'], 'sp': ['Sole Proprietorship'], 'sp z oo': ['Limited'], 'sp zoo': ['Limited'], 'spa': ['Corporation'], 'spj': ['General Partnership'], 'spk': ['Limited Partnership'], 'spol s ro': ['Limited Liability Company'], 'spolka z oo': ['Limited'], 'spp': ['Limited Liability Partnership'], 'sprl': ['Limited'], 'srl': ['Limited', 'Limited Liability Company'], 'sro': ['Limited Liability Company'], 'ss': ['General Partnership'], 'stg': ['General Partnership'], 't:mi': ['Sole Proprietorship'], 'tapui': ['Limited'], 'teo': ['Limited'], 'tmi': ['Sole Proprietorship'], 'tov': ['Limited'], 'uab': ['Limited'], 'ultd': ['Joint Stock / Unlimited'], 'unlimited': ['Joint Stock / Unlimited'], 'unltd': ['Joint Stock / Unlimited'], 'vat': ['Limited Liability Company'], 'vea\\x99 obch spol': ['General Partnership'], 'vof': ['General Partnership', 'Professional Corporation'], 'vos': ['General Partnership'], 'vzw': ['Non-Profit'], 'xk': ['Private Company'], 'xt': ['Sole Proprietorship'], 'yoaj': ['Joint Stock / Unlimited'], 'zat': ['Limited Liability Company'], 'zrt': ['Limited']})

Derive general business type from legal entity form

Args:

joined_lef: joined string of legal entity forms, from extract_lef(). types_by_lef: default is TYPES_BY_LEF classification from cleanco.

Returns:

joined string of general business types found.

Parameters:

joined_lef (str)

emm.features.features_lef.make_combi(joined1, joined2)

Make combined string utility function

Parameters:
  • joined1 (str)

  • joined2 (str)

Do two legal entity forms match

Args:

term1: legal entity form 1 term2: legal entity form 2

Returns:

matching string.

Parameters:
  • term1 (str)

  • term2 (str)

emm.features.features_lef.types_by_lef_dict(lefs_by_type={'Corporation': ['company', 'incorporated', 'corporation', 'corp.', 'corp', 'inc', '& co.', '& co', 'inc.', 's.p.a.', 'n.v.', 'a.g.', 'ag', 'nuf', 's.a.', 's.f.', 'oao', 'co.', 'co'], 'General Partnership': ['soc.col.', 'stg', 'd.n.o.', 'ltda.', 'v.o.s.', 'a spol.', 'veÅ\x99. obch. spol.', 'kgaa', 'o.e.', 's.f.', 's.n.c.', 's.a.p.a.', 'j.t.d.', 'v.o.f.', 'sp.j.', 'og', 'sd', ' i/s', 'ay', 'snc', 'oe', 'bt.', 's.s.', 'mb', 'ans', 'da', 'o.d.', 'hb', 'pt'], 'Joint Stock / Unlimited': ['unltd', 'ultd', 'sal', 'unlimited', 'saog', 'saoc', 'aj', 'yoaj', 'oaj', 'akc. spol.', 'a.s.'], 'Joint Venture': ['esv', 'gie', 'kv.', 'qk'], 'Limited': ['pty. ltd.', 'pty ltd', 'ltd', 'l.t.d.', 'bvba', 'd.o.o.', 'ltda', 'gmbh', 'g.m.b.h', 'kft.', 'kht.', 'zrt.', 'ehf.', 's.a.r.l.', 'd.o.o.e.l.', 's. de r.l.', 'b.v.', 'tapui', 'sp. z.o.o.', 'sp. z o.o.', 'spółka z o.o.', 's.r.l.', 's.l.', 's.l.n.e.', 'ood', 'oy', 'rt.', 'teo', 'uab', 'scs', 'sprl', 'limited', 'bhd.', 'sdn. bhd.', 'sdn bhd', 'as', 'lda.', 'tov', 'pp'], 'Limited Liability Company': ['pllc', 'llc', 'l.l.c.', 'plc.', 'plc', 'hf.', 'oyj', 'a.e.', 'nyrt.', 'p.l.c.', 'sh.a.', 's.a.', 's.r.l.', 'srl.', 'srl', 'aat', '3at', 'd.d.', 's.r.o.', 'spol. s r.o.', 's.m.b.a.', 'smba', 'sarl', 'nv', 'sa', 'aps', 'a/s', 'p/s', 'sae', 'sasu', 'eurl', 'ae', 'cpt', 'as', 'ab', 'asa', 'ooo', 'dat', 'vat', 'zat', 'mchj', 'a.d.'], 'Limited Liability Limited Partnership': ['lllp', 'l.l.l.p.'], 'Limited Liability Partnership': ['llp', 'l.l.p.', 'sp.p.', 's.c.a.', 's.c.s.'], 'Limited Partnership': ['gmbh & co. kg', 'lp', 'l.p.', 's.c.s.', 's.c.p.a', 'comm.v', 'k.d.', 'k.d.a.', 's. en c.', 'e.e.', 's.a.s.', 's. en c.', 'c.v.', 's.k.a.', 'sp.k.', 's.cra.', 'ky', 'scs', 'kg', 'kd', 'k/s', 'ee', 'secs', 'kda', 'ks', 'kb', 'kt'], 'Mutual Fund': ['sicav'], 'No Liability': ['nl'], 'Non-Profit': ['vzw', 'ses.', 'gte.'], 'Private Company': ['private', 'pte', 'xk'], 'Professional Corporation': ['p.c.', 'vof', 'snc'], 'Professional Limited Liability Company': ['pllc', 'p.l.l.c.'], 'Sole Proprietorship': ['e.u.', 's.p.', 't:mi', 'tmi', 'e.v.', 'e.c.', 'et', 'obrt', 'fie', 'ij', 'fop', 'xt']})

Business types by legal entity form

Invert cleanco’s dictionary terms_by_type.

Args:

lefs_by_type: cleanco’s terms_by_type dictionary.

Returns:

types_by_lef dict

emm.features.features_name module

emm.features.features_name.abbr_match(str_with_abbr, str_with_open_form)

If str_with_abbr contains both upper & lower case characters, we use original method, otherwise we apply approximate check: all short words (with length from range 2..5) are tested for abbreviation.

Parameters:
  • str_with_abbr (str)

  • str_with_open_form (str)

Return type:

bool

emm.features.features_name.abs_len_diff(name1, name2)

Difference (in characters) in lengths of names

Parameters:
  • name1 (str)

  • name2 (str)

Return type:

int

emm.features.features_name.calc_name_features(df, funcs, name1='preprocessed', name2='gt_preprocessed')
Parameters:
  • df (DataFrame)

  • funcs (dict[Callable, str])

  • name1 (str)

  • name2 (str)

Return type:

DataFrame

emm.features.features_name.extract_abbr_merged_initials(abbr, name)

Extract possible open form of the given abbreviation if exists examples: (SK, Fenerbahce Spor Klubu) => Spor Klubu

Parameters:
  • abbr (str)

  • name (str)

Return type:

Optional[Match]

emm.features.features_name.extract_abbr_merged_word_pieces(abbr, name)

Extract possible open form of the given abbreviation if exists examples: (PetroBras, Petroleo Brasileiro B.V.) => Petroleo Brasileiro

Parameters:
  • abbr (str)

  • name (str)

Return type:

Optional[Match]

emm.features.features_name.find_abbr_merged_initials(name)

Finds abbreviations with merged initials examples: FC Barcelona => FC, ING BANK B.V. => BV

Parameters:

name (str)

Return type:

list[str]

emm.features.features_name.find_abbr_merged_word_pieces(name)

Finds abbreviations with merged word pieces examples: PetroBras

Parameters:

name (str)

Return type:

list[str]

emm.features.features_name.len_ratio(name1, name2)

Calculates the lengths’ ratio (1 means the same lengths, 0.5 one name is two times longer)

Parameters:
  • name1 (str)

  • name2 (str)

Return type:

float

emm.features.features_name.name_cut(name1, name2)

Tests if one name is a prefix of other

Parameters:
  • name1 (str)

  • name2 (str)

Return type:

bool

emm.features.features_name.original_abbr_match(str_with_abbr, str_with_open_form)

Checks if the second string has an open form of an abbreviation from the first string

Parameters:
  • str_with_abbr (str)

  • str_with_open_form (str)

Return type:

bool

emm.features.features_rank module

emm.features.features_rank.calc_diff_features(df, funcs, score_columns, uid_col='uid', fillna=-1)
Parameters:
  • df (DataFrame)

  • funcs (dict[str, Callable])

  • score_columns (list[str] | None)

  • uid_col (str)

  • fillna (int)

Return type:

DataFrame

emm.features.features_rank.calc_rank_features(df, funcs, score_columns, uid_col='uid', fillna=-1)
Parameters:
  • df (DataFrame)

  • funcs (dict[str, Callable])

  • score_columns (list[str] | None)

  • uid_col (str)

  • fillna (int)

Return type:

DataFrame

emm.features.features_rank.diff_to_next(df, c, uid_col)
emm.features.features_rank.diff_to_prev(df, c, uid_col)
emm.features.features_rank.dist_to_max(df, c, uid_col)
emm.features.features_rank.dist_to_min(df, c, uid_col)
emm.features.features_rank.feat_ptp(df, c, uid_col)
emm.features.features_rank.group_by_uid(df, c, uid_col)
emm.features.features_rank.ptp(a)

Numpy ptp that is safe if input contains no elements or only NaN.

Range of values (maximum - minimum) along an axis.

Parameters:

a (array)

emm.features.features_rank.rank(df, c, uid_col)
emm.features.features_rank.top2_dist(df, c, uid_col)

emm.features.features_vocabulary module

class emm.features.features_vocabulary.Vocabulary(very_common_words, common_words)

Bases: object

Parameters:
  • very_common_words (set[str])

  • common_words (set[str])

common_words: set[str]
very_common_words: set[str]
emm.features.features_vocabulary.compute_vocabulary_features(df, col1, col2, very_common_words=None, common_words=None)

Features on tokens

Args:

df: input DataFrame col1: name to compare col2: other name to compare common_words: pre-computed common words very_common_words: pre-computed very common words

Returns:

DataFrame with features, e.g.: - hits: words in both names - misses: words that is just one name (on either side)

Parameters:
  • df (DataFrame)

  • col1 (str)

  • col2 (str)

  • very_common_words (Optional[set[str]])

  • common_words (Optional[set[str]])

Return type:

DataFrame

emm.features.features_vocabulary.create_vocabulary(df, columns, very_common_words_min_df=0.01, common_words_min_df=0.0001)

Get two sets of ‘common’ and ‘very common’ words

Args:

df: data to obtain the vocabulary from columns: columns to compute the vocabulary from very_common_words_min_df: minimal document frequency to be considered ‘very common’ common_words_min_df: minimum document frequency to be considered ‘common’

Examples:
>>> vocabulary = create_vocabulary(
>>>    df,
>>>    columns=["preprocessed", "gt_preprocessed"],
>>>    very_common_words_min_df=0.05,
>>>    common_words_min_df=0.005,
>>> )
>>> print(vocabulary.very_common_words)
{"hello", "world"}
>>> print(vocabulary.common_words)
{"the", "a", "in"}
Returns:

Vocabulary with common and very common words

Parameters:
  • df (DataFrame)

  • columns (list[str])

  • very_common_words_min_df (float | int)

  • common_words_min_df (float | int)

Return type:

Vocabulary

emm.features.pandas_feature_extractor module

class emm.features.pandas_feature_extractor.PandasFeatureExtractor(name1_col='preprocessed', name2_col='gt_preprocessed', uid_col='uid', gt_uid_col='gt_uid', score_columns=None, extra_features=None, vocabulary=None, without_rank_features=False, with_legal_entity_forms_match=False, fillna_value=None, drop_features=None)

Bases: TransformerMixin, BaseFeatureExtractor

Sklearn based transformer for calculating numeric features for candidate pairs (used by supervised model)

Args:

name1_col: column with name from names to match name2_col: column with name from ground truth uid_col: column with unique ID of row from names to match score_columns: list of columns with raw scores from indexers extra_features: list of columns used for extra features (i.e. country) without_rank_features: if False then score rank based features will be calculated (can be overridden in transform function) with_legal_entity_forms_match: if True, add match of legal entity forms feature fillna_value: fill nans with float value. default is None. drop_features: list of features to drop at end of calculation, before passing to sm. default is None.

Parameters:
  • name1_col (str)

  • name2_col (str)

  • uid_col (str)

  • gt_uid_col (str)

  • score_columns (Optional[list[str]])

  • extra_features (Optional[list[str | tuple[str, Callable]]])

  • vocabulary (Optional[Vocabulary])

  • without_rank_features (bool)

  • with_legal_entity_forms_match (bool)

  • fillna_value (Optional[float])

  • drop_features (Optional[list[str]])

fit(X, y=None)
Parameters:
  • X (DataFrame)

  • y (Optional[Series])

Return type:

PandasFeatureExtractor

transform(X)
Parameters:

X (DataFrame)

Return type:

DataFrame

Module contents

class emm.features.PandasFeatureExtractor(name1_col='preprocessed', name2_col='gt_preprocessed', uid_col='uid', gt_uid_col='gt_uid', score_columns=None, extra_features=None, vocabulary=None, without_rank_features=False, with_legal_entity_forms_match=False, fillna_value=None, drop_features=None)

Bases: TransformerMixin, BaseFeatureExtractor

Sklearn based transformer for calculating numeric features for candidate pairs (used by supervised model)

Args:

name1_col: column with name from names to match name2_col: column with name from ground truth uid_col: column with unique ID of row from names to match score_columns: list of columns with raw scores from indexers extra_features: list of columns used for extra features (i.e. country) without_rank_features: if False then score rank based features will be calculated (can be overridden in transform function) with_legal_entity_forms_match: if True, add match of legal entity forms feature fillna_value: fill nans with float value. default is None. drop_features: list of features to drop at end of calculation, before passing to sm. default is None.

Parameters:
  • name1_col (str)

  • name2_col (str)

  • uid_col (str)

  • gt_uid_col (str)

  • score_columns (Optional[list[str]])

  • extra_features (Optional[list[str | tuple[str, Callable]]])

  • vocabulary (Optional[Vocabulary])

  • without_rank_features (bool)

  • with_legal_entity_forms_match (bool)

  • fillna_value (Optional[float])

  • drop_features (Optional[list[str]])

fit(X, y=None)
Parameters:
  • X (DataFrame)

  • y (Optional[Series])

Return type:

PandasFeatureExtractor

transform(X)
Parameters:

X (DataFrame)

Return type:

DataFrame