Guide to Fitting, Predicting and Creating Functions for Machine Learning Models

In the world of data science, there are a variety of machine learning models at a data scientists’ disposal. Often, we are presented with a dataset and are required to identify which ML algorithm suits our data best. While there are built-in processes such as TPOT that can automate this process for us, as a student, it is useful to fit each model manually, in order to fully understand how each model works.

The aim of this blog is to complete a walk-through of how to instantiate, fit, predict, and preview results of a machine-learning model. However, once I walk through this process, I will introduce how to create a function that performs these same steps, in order to assist new students with creating a quicker, more efficient workflow. Let’s get started!

1. Context

The dataset we will explore looks at 2015 Flight Delays and Cancellations. The goal of the analysis is to see which factors help predict whether a flight will be delayed or not, particularly if a specific airport, airline, month or day of the week influences the result. To start, I examined the distribution in our dataset of cancelled and non-cancelled flights:

2. Fitting the Model

I then fit the model by setting our X and Y values, followed by a train test split. It is clear from the image above there is a strong class imbalance in the dataset, so I implemented a SMOTE technique in order to mitigate these differences. See the SMOTE documentation for more details here.

3. Vanilla Model Example

You need to use these resampled Xs and Ys in order to fit our model, but any predictions will be done on the original X_train and y_train values. See below for a basic, vanilla example using the ADBoost algorithm.

The model performs quite well overall, particularly with a strong accuracy and precision score for the test set. Of course, further tuning of this model, by including hyper-parameters, could only improve the model further. That is outside the scope of this blog post, but is quite important to note that so far we have only trained the vanilla model.

3. Creating a function

The process for fitting many machine algorithms is the same; fit on the SMOTE values, predict on original X_train and X_test, and then display accuracy, precision, recall and f1 scores. There are other results that can be shown as well, such as the classification report and a confusion matrix. If we were to run several ML models, we would need to rewrite the same code over and over again.

A much faster, as well as reader-friendly approach is to create a function that performs all these steps, and then just call the function on each different model that you instantiate.

Take a look at the function below:

The model takes in several inputs: the X and Y SMOTE values, the original X_train, y_train, X_test, and y_test, followed by the model classifier, and then the name of the model.

Next, the model is fit using the SMOTE Xs and Ys. The predictions are then calculated using the original X_train and X_test values. Then, a dictionary called “result” is created, which contains the model type and name, then the accuracy, precision, recall and f1 scores for both the train and test sets.

Finally, the function prints out several items, so that when the function is called, specific things are shown. First, we print the classification report for the train data, followed by the accuracy, precision, recall and f1 scores that were calculated in the dictionary within the function. Lastly, the function creates subplots using Matplotlib to display the confusion matrices for the train and test data.

While this code looks like a lot, it is quite easy to understand once it is used in practice. Below, we call the function on the ADBoost classifier that we instantiated before.

And that’s it! One simple function was written, and we quickly now have all the results for the model in one place. If you would like to run different machine learning models, simply instantiate the classifier, then run the model through your function, and viola!

Hopefully this post gave you a solid understanding of how building functions can simplify your modeling process, by eliminating the need to write repeated code over and over again. It is important to always consider writing a function when building machine learning models, as it can make your work a lot easier to follow.

Thank you for reading and good luck with building your machine learning models!

Data Science Student at Flatiron School

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store