Oftentimes in Machine Learning, you are working towards solving a classification problem. For example, will someone default on their credit card payment, or pay it on time? Or, based on specific factors, will the flight be classified as delayed or non-delayed? In order to solve these problems, datasets will need to have data for each possible outcome, otherwise known as each “class”. However, the amount of information available for each class may be uneven. If 90% of your data belongs to one class, and 10% to the other, you will be faced with the issue of class imbalance. If you use this imbalanced data when running your machine learning models, it is likely that your models will misclassify several results due to the frequency of certain classes in your dataset. As a result, there are several techniques that can be implemented to negate this issue, and we will cover these right now in this blog post!
- SMOTE technique
The “SMOTE” technique works by creating synthesized samples for your dataset, using a nearest neighbors method, particularly for the class that is less frequent. This is considered an “oversampling” method, since we are creating additional observations for the minority class. Once this resampled data is generated, it can be trained on using any machine learning algorithm.
Here is an example of the SMOTE technique in action; the original dataset was a 64%/36% split between the two classes, however once SMOTE was applied, the synthesized sample data had a 50/50 split. This new resampled data can then be used to overcome the issues that occur with imbalanced datasets.
Another idea is to remove information from the majority class. As an example, using the original class distribution above, we would apply undersampling by removing information from the “0” class in the original class distribution, which would help even out the class distribution of the dataset. The downside to this technique, however, is that valuable information could be removed from the data while attempting to achieve a more even class distribution. As a result, this method is best when the dataset is very large, as it poses less of a risk of removing crucial information from the dataset being used.
Implementing this process is quite simple; below is an example from my “Predicting Flight Delay” project on Github where the majority class are flights that are not delayed, therefore the column where ‘DELAYED’=0. Here, using sckit-learn’s resample method, I downside this target value in order to improve the class imbalance:
3. Change performance metric
When dealing with imbalanced data, it is important to pay attention to which performance metric you are aiming to optimize; either accuracy, precision, recall or F1 score. Accuracy is often a default metric to analyze, however precision, recall, or F1 score may be more appropriate depending on the context of the problem. A brief explanation of these metrics is as follows:
Precision: Precision identifies which items that were predicted as positive are actually positive. This is calculated as True Positives / (True Positives + False Positives). A high precision score means that there are a low number of false positives and if false positives are harmful to your model, precision should be optimized.
Recall: Recall identifies how many total positives there are out of all the actual positives, i.e. ones that are positive but are classified as false negatives. To calculate recall, the formula is True Positives / (True Positives + False Negatives). A high recall score would indicate that there are a low number of false negatives. If false negatives are harmful to your model, i.e. identifying credit card fraud, then it is important to optimize recall.
F1 score: If both minimizing false negatives and false positives are important to your model, then F1 score is likely the most appropriate. F1 score provides a weighted average between precision and recall, which is calculated as: 2True Positives /(2True Positives + False Positives + False Negatives). If you are looking to balance precision and recall in your analysis, this calculation for F1 score does exactly that.
4. Principal Component Analysis (PCA)
PCA can be applied to mitigate various issues related to machine learning modeling. Overall, PCA can be used to improve sampling of very large datasets by removing their dimensionality, which can be done by reducing the amount of variables without removing much of the information from the dataset. The variables are grouped together based on their similarity to one another and the relative importance of each variable. PCA is a rather complex concept built on linear algebra which I will explore in more detail in a future blog post, however it is relevant to consider PCA when selecting sampling techniques as a way to improve class imbalances.
This blog post only considers some of the many methods that can be used when dealing with class imbalance. Overall, when choosing the ideal method, it is important to determine what you are aiming to achieve in your analysis, and which performance metric you would like to optimize. Thank you for reading!