86 Imbalanced Overview

86.1 Imbalanced Overview

Whether to keep this chapter in this book or not was considered, as the methods in this section are borderline feature engineering methods.

The potential problem with imbalanced data can most easily be seen in a classification setting. Propose we want to predict if an incoming email is spam or not. Furthermore, we assume that the spam rate is low and around 1% of the incoming emails. If you are not careful, you can easily end up with a model that predicts “not spam” all the time, since it will be correct 100% of the time. This is a common scenario, and it happens all the time. One culprit could be that there isn’t enough information in the minority class to be able to distinguish it from the majority class. And that is okay, not all modeling problems are easy. But your model will do its best anyway.

There are several different ways to handle imbalanced data, we will list all the ways in broad strokes, and then cover the methods that we could count as feature engineering.

using the right performance metrics
weights
loss functions
Calibrating the predictions
Sampling the data, can be done naively and with more advanced methods

The example above showcases why accuracy as a metric isn’t a good choice when the classes are not uniformly represented. So you can look at other metrics, such as precision, recall, ROC-AUC, or Brier score. You will need to know what metric works best for your project.

Another way we can handle this is by adding weights, either to the observations directly, or in the modeling framework as class weights. Giving your minority class high enough weight forces your model to consider them.

Related to the last point, some methods allow you the user to pass in custom objective functions, this can also be beneficial.

Even if your model performs badly by default, your classification model might still have good separation, just not around the 50% cut-off point. Changing the threshold is another way you can overcome an imbalanced data set.

Lastly, and the ways that will be covered in this book is a sampling of the data. There are several different methods that we will cover in this book. These methods cluster somehow, so for some groups we only explain the general idea.

We can split these methods into two groups, under-sampling methods and over-sampling methods. In the under-sampling method, we are removing observations and in the over-sampling we are adding observations. Adding observations is usually done by basing the new observations on the existing observations, exactly or by interpolation.

Over-sampling methods we will cover are:

Up-sampling in Chapter 87
ROSE in Chapter 88
SMOTE in Chapter 89
SMITE variants in Chapter 90
Borderline Smote in Chapter 91
Adaptive Synthetic Algorithm in Chapter 92

Under-sampling methods we will cover are:

Down-sampling in Chapter 93
NearMiss in Chapter 94
Tomek Links in Chapter 95
Condensed Nearest Neighbor in Chapter 96
Edited Nearest Neighbor in Chapter 97
Instance Hardness Threshold in Chapter 98
One-Sided Selection Chapter 99

Some methods do these methods together. We won’t consider those methods by themselves and instead let you know that you can do over-sampling followed by under-sampling if you choose.