Persistence¶
Here’s how to save and load entity matching models.
Store and load pandas-based model
p.save("pandas_entity_matching_model.pkl")
from emm import PandasEntityMatching
p2 = PandasEntityMatching.load("pandas_entity_matching_model.pkl")
Apply the pandas model as usual:
p2.transform(names_pandas)
Store and reopen a spark-based model in the same way, but in a directory
s.save("spark_entity_matching_model")
from emm import SparkEntityMatching
s2 = SparkEntityMatching.load("spark_entity_matching_model")
For both pandas and spark, by default we use the joblib
library with compression
to store and load all non-spark objects.
The load and dump functions used can be changed to different functions:
io = emm.helper.io.IOFunc()
io.writer = pickle.dump
io.reader = pickle.load
Note that reader
and writer
are global attributes, so they get picked up by all
classes that use IOFunc
, and only need to be set once.
For example, one will need to change these functions for writing and reading to s3
.