Building a Python Package for your ML model
Move your model from Jupyter Notebooks to Python Package
Data Scientists love Jupyter (and other) notebooks since the notebooks are phenomenal when it comes to building and testing a machine learning model. A quick and easy way without any hassle, and it lets anyone start in data science without much hindrance. Notebooks are easy to use, easy to understand, and are widely used in the data science community for building and sharing models.
However, when it comes to deploying your model in a production system or building a microservices architecture, a Jupyter or any other notebook doesn’t serve any purpose. Lots of people have written why you shouldn’t keep working only in Jupyter notebooks (search on google), but this presentation stands out. I strongly recommend you to view this awesome work no matter wherever you are doing, totally worth your time
The point here is that apart from using Notebooks, a data scientist should also know how to write a production-ready code rather than relying on someone else. They should know how to build a model in a notebook (Jupyter or Zepplin or anything else) and convert it to a code adhering to best practices in software engineering, and make it more organized, modular, maintainable and deployable.
The article below is the journey of a model from Jupyter notebook to the same model in a Python package. All the codes are in this Github repo, Step by step. In case of any doubt, download the code for that particular step only and analyze
The next step is the deployment of this package on the Python Package Index (PyPi), converting it to a Flask app and deploying it on the web as an API. This will be covered in the next article.
Data: The data is freely available Housing Prices Prediction dataset on Kaggle. The notebooks section covers a lot of models built by many people.
Disclaimer: As mentioned earlier, there can be inaccuracies in the models below, or something could have been done more accurately, or a different model could have been used. This article does not intend to show how to build a high accuracy model (this is just a code for reference), but the aim is to show how to write a deployable code for a model using Pipelines. Also, this code has been obtained after removing a lot of code involving feature selection, etc, so that I can shorten this article.
Step 1: Build a base model in Jupyter Notebook
A very basic beginner’s model where we read in data, do some missing value imputation on numerical and categorical variables, encode some features, transform some features, and then run a Lasso model on the transformed data. In the end, we use the model to make predictions on the test data. Nothing out of the box, just basic stuff.
Step 2: Rearrange some code to make it a bit more readable
Not very necessary, but it helps to organize your code. For example, the biggest change here is to keep all the config variables and variables list for different tasks in one place. Of course, I ran a lot of other steps to arrive at these lists of variables, but that you can figure out by any finished model.
How the cell with all the config variables looks like:
This is helpful so that after some time when you do more R&D on this model, and figure out that you need to add a couple of more variables to the model, or drop a feature, or do a different transform on a variable, you just have to change this config part of your entire package, nothing else.
Step 3: Convert the code to a python script (.py)
This is a simple step. You can copy-paste the Jupyter code in the code editor (Pycharm or Sublime or VSCode), or download the notebook in .py script format using the File → Download as →Python(.py) option. I prefer a plain code editor (not PyCharm or Anaconda/Spyder) since I want to run the entire code as a whole in the terminal, not chunks of it (a good habit in some cases except when you need to run the code chunk by chunk)
This is how the code looks like
Step 4: Break the code into different files (modules)
How do you do it? Start with separating all the config variables into a separate config.py script, and import that script into your main code. This is just a demo of how to do this for one part of the code (config variables), we will do it further for other parts too.
Step 5: Convert each step into a function that takes a dataframe and returns it after processing
For example, loading a dataset using pd.read_csv can be transferred to a separate file named ‘data_management.py’ which will have a function that take in filename as an argument, reads the data, and returns it. This can be imported into the main code using ‘from data_management import load_dataset’. Later, more data management functions (save results, load pipeline, etc) can be stored in the same file
Importing the load_dataset function from data_management.py script. Also imported is the entire preprocessors.py file with all the functions. Train & Test datasets are read by picking up the filename from config.py file, and passing them to load_dataset functions which read and returns the dataframes
Similarly, every other code can be transformed into a function that takes in some arguments, performs some processing, and returns a dataframe. This is how the individual functions look like in preprocessors.py file
These functions are invoked into the main code by using the import preprocessors as pp statement and then calling each of the functions one by one. I know, a bit cumbersome when everything could have been done in one function but will explain in part 6 why a separate function for every task is needed
The code directory now has 4 files — The main_code.py, config.py, preprocessors.py and data_management.py
Step 6: Convert functional code to Pipelines
What are pipelines: Pipelines are one of the most useful components of sklearn library using which you can break your analysis into individual steps, the output of one leading to the input of other (hence the name - pipeline). Not every function can be directly used as a pipeline format but has to be written in a specific format deriving from some built-in classes on sklearn. There are a lot of in-built functions that can be put in the pipeline directly, which you can also write your custom functions for pipelines.
More details on Pipelines, and how to build a custom pipeline — read this blog here. If you are new to pipelines, take some time to go through this, and understand the examples. The same functions in this code have been used in the blog on pipelines, so you can relate to them directly.
Your config.py and data_management.py code will not change.
Preprocessors.py: Inplace of functions, you now have classes that will be used in the pipeline. An example is shown below: The commented part is the older version (function) which has been converted to a class that derives from sklearn’s BaseEstimator and TransformerMixin classes [from sklearn.base import BaseEstimator, TransformerMixin] . This will contain the ‘fit’ and ‘transform’ methods that will perform operations on data
After converting every function to the corresponding pipeline class, there will be an additional file pipeline.py that will contain our final pipeline. You need to import Pipeline from sklearn.pipeline to build your pipeline [from sklearn.pipeline import Pipeline]
The second last function MinMaxScaler() is a built-in one and has been imported from sklearn.preprocessing module. This is just to show that you can combine your custom-built functions with the inbuilt function. The last function [‘Linear Model’, Lasso(…)] is also the built-in modeling function and this has to be only a .fit function (not .transform for the last step in a pipeline)
How is this called in the main code? The main_code.py is now just a very small code since everything has been moved into their own compartments.
Step 7: Clean up the code, add a few bells and whistles
Most of the heavy lifting is done, but some are left. This is essentially further compartmentalizing your code so that for bigger projects, you will not have to face any issue. We will create a few folders for keeping these codes.
Why more folders? For example, we have a config.py file that contains all the config information related to the model. When you add, say, a logging functionality (which you should if your model is getting deployed), you will need to add a logging_config.py file to your package. It’ll be good to have all the config files in one folder. Similarly, you can split your preprocessors.py file into one code that does all the imputations and encoding, and another which does all the transformations (ex., log_transform, etc). All these can go in the processing folder
Here is the updated structure:
Folders:
config : Contains config.py, or any other config file
datasets: Train and Test datasets here. Not needed if you are loading datasets from url, or just using trained model (.pkl)
processing: Has data_management (load, save data and pipelines) and preprocessing (classes for pipelines)
trained_models: a place for saving the trained model in .pkl format once it is built. The predict.py code uses this model (loaded using load_model from data_management.py) to make the prediction
pipeline.py: code for making the pipeline using classes and functions in preprocessors.py
predict.py: code for making the prediction
train_pipeline.py: Main code
Some more stuff has been added here. For example, functionality to save and load a pipeline in data_management.py
Also, the train_pipeline has been changed in a different format — more like a proper python script rather than the one in the previous step. We are saving the model in the run_training() step and loading it again for make_prediction() step [see predict.py for details]
I have added a small code for testing the prediction, this will be removed when it is converted to package as it’s a bad idea to put print statements in a package. Also, the make_prediction function is present in predict.py file
See the Github repo for further details and complete codes.
The entire prediction model in this format and structure is just the starting point of the final output — a python package or a Flask API.
Step 8: Bringing it all together in a Python Package
Link to final package folder : Be aware for the folder structures, it has changed from part 7. All modules and folders are under the folder regression_model, which is also the name of the package
Next step is to convert this code into a python package. There are a bunch of stuff that needs to be done to achieve this, but everything is doable. Here is a blog link than explains how to do this step by step, follow the steps and you’ll be able to build the package
What Next?
As mentioned earlier, this is the beginning. This is the bare minimum a data scientist must be able to do to go to the next step — deploy this package on PyPi or as an API. We will be covering this in the next blog post.
Also, few of the previous readers had complained that the code is in fragments and difficult to follow due to multiple versions and other reasons. I’ll try to do this entire process of Jupyter notebook to PyPi package in a video stream to give more clarity to these steps. Will post soon