Handling imbalanced data
When performing classification, one problem occurs when you have much more data in one class than another class, roughly defined as 4:1 or greater ratio in the binary case, though imbalance can also occur in multi-class datasets. In these scenarios, the accuracy paradox states that you can get very high accuracy but really your predictions are mostly biased towards the more abundant class.
These scenarios occur fairly often in certain fields, like customer surveys since the vast majority of people don’t fill out the forms, or financial fraud detection since the majority of purchases are legitimate. In fact, there is a well-known Kaggle challenge on detecting credit card fraud.
Approaches for handling imbalanced data
As discussed both informally and more academically, there are multiple ways to deal with imbalanced data:
-
Reframing the problem: This could involve changing your performance metrics. Accuracy isn’t the only thing you can use to measure the performance of a model. There are also F-scores, Cohen’s κ, etc. Alternatively, you can decompose the larger class into smaller classes, which can then be paired against the minority class using ensemble classifiers. Alternatively, you can treat the scenario like outlier detection and using one-class classifiers.
-
Data-level methods: If you have a lot of data (tens of thousands of rows), you could try oversampling the rare class, or undersampling the abundant class. Or you could generate new data for the rare class. Like resampling, this attempts to shift the balance of the classes to your favor by generating data that is similar to the rare class. There are mature algorithms for this (see below).
-
Algorithm-level methods: This involves imposing costs on mistaken prediction using penalized models, pushing your predictions away from the majority class.
Tools for handling imbalanced data
For oversampling, R has, among others, the smotefamily package. Python has the imbalanced-learn package. Both R and Python implements the oversampling algorithms below:
-
SMOTE (Synthetic Minority Oversampling TEchnique) was presented in a 2002 JAIR paper, which has a nice explanation here. SMOTE generates additional samples of rare class by selecting their K-nearest neighbors.
-
ADASYN (Adaptive Synthetic Sampling), published in 2008 builds upon SMOTE. More data is generated for minority class samples that are harder to learn, trying to make your predictions more robust.
On the other hand, ROSE (Random Over-Sampling Examples), published in 2012, and implemented in the ROSE R package, uses both over- and under-sampling.
Back to fraud-detection
One newer approach is to encode the data using autoencoders, which creates a representation of the data, normally for dimension reduction purposes. However, since fraudulent events and legitimate events follow different distributions, with fraudulent events incurring higher reconstruction errors on the encoding, looking at the reconstruction errors after autoencoding would reveal the fraud. A nice walkthrough using R and Keras can be found on the RStudio blog.