Tomek Links, SMOTE, and XGBoost for Fraud Detection

Eric Johnson
17 min readJun 25, 2021

--

Table of Contents:

  1. Introduction
  2. A First Look at the Data
  3. Imbalanced Learning Problems

4. SMOTE

5. Tomek Links

6. Why XGBoost

7. Variables Used and Hyperparameter Tuning

8. Results and Conclusion

Introduction

In this article I will describe a method to resample the Kaggle Credit Card Fraud dataset to feed into XGBoost for credit card fraud detection. The resampling is a combination of under- and over-sampling. The under-sampling is accomplished by Tomek Links and the over-sampling by SMOTE. Implementing the resampling is easy with the imblearn package, but understanding what it is we are doing, and in what order, is critical to explaining why this is a valid processing step. I will explain this in the section labelled Imbalanced Learning Problems.

The resampling itself improves out-of-the-box XGBoost performance from an 84% fraud detection rate (recall of the fraud data points) and a 99%+ correct classification rate (overall accuracy)to an 88% fraud detection rate and a 99%+ correct classification rate. After tuning XGBoost hyperparameters, I end up with a model that has a 92% fraud detection rate and a 98% correct classification rate. These results vary up or down by a few percentage points depending on the chosen random seed. The obtained recall and overall accuracy values sit near the middle of their respective distributions.

As a new data scientist with a PhD in mathematics I was excited to work with this dataset because I believed it presented an opportunity to focus on a few cutting edge classification algorithms. For instance, XGBoost is a fascinating machine learning algorithm that combines mathematical, statistical and computer hardware insights to build a wide array of classification and regression trees. What was truly interesting is that the real challenge with this dataset was finding ways to overcome the class imbalance before modeling even happened.

A First Look at the Data

A description of the dataset can be found here. When we load the dataset, we see that there are 31 variables, 30 of which are data sampled from a continuous variable. I.e., there are no categorical variables other than the class labels which are binary, 0 for a legitimate transaction and 1 for a fraudulent transaction. The 28 variables “V1” — “V28” have been created from a principal component analysis, and so these variables are statistically (and linearly) independent of each other. This also means that most of the variables are not interpretable in terms of domain specific knowledge. The only variables that come with meaningful names are “Time”, “Amount” and “Class”.

The “Time” variable is defined as the time duration since the first transaction in the dataset. This is very unlikely to help us since there is no account identification information. This means we cannot calculate the time in-between consecutive transactions for a given credit account. However, it could be argued that if we knew the date on which the first transaction happened then we might be able to track a seasonal component to credit fraud. Maybe there is more fraud generally in the summer (pure speculation)? The date of the first transaction or even a creation date for the dataset are not available. So we drop the “Time” column when we split the dataset into potential model features, X, and the class labels, y.

Below is a 3d-plot of the three most significant components derived from PCA analysis of the dataset.

This plot is evidence for a severe class imbalance. We will talk about that just below and verify that there is a severe class imbalance in the dataset. PCA generally preserves features like clustering. I say generally because counterexamples where PCA destroys cluster information are easy to cook up. You can click here if you would like to read a Towards Data Science article detailing an alternative to PCA called PPA. Moreover, if we are to believe that this PCA preserved most of the important information, the amount of overlap between the fraud data points and legitimate points seems like it might be a challenge to overcome. Luckily, the tight packing of fraudulent and legitimate transactions displayed in this plot will not end up having much impact on final predictive performance.

By creating a simple histogram for each variable, one sees that the “Amount” variable covers a very large range of numbers and that the data is concentrated in two main clusters, a low transaction amount cluster and a high transaction amount cluster. To prevent this large change of scale from detracting from a better model, we transform the Amount variable to log space and create the “Log_Amount” variable. I am not going to show code for this as it will turn out that “Log_Amount” and “Amount” both have a low correlation with the class labels and I did not use them in the final model.

The Kaggle Credit Card Fraud Dataset is severely imbalanced. We can see this by doing a count of the values in the “Class” column and making the following chart (found just below the code) using Matplotlib.

In general, before we have chosen a specific algorithm, there are several standard options for handling this kind of imbalance. Some of these options work at the algorithm level, where a penalty is implemented for classifying the minority class incorrectly or a bias is set. However, an alternative is to do one or both of under-sampling the majority class or over-sampling the minority class. These ideas can be applied to multi-class problems as well, but we will not consider that here.⁴ For this project I used a combination of both under- and over-sampling with the imblearn package, https://imbalanced-learn.org/stable/combine.html

Imbalanced Learning Problems

Imbalanced learning problems are an important subclass of machine learning problems. Class imbalance arises more often in industry than one might think. For instance: fraud detection, some areas of medical diagnostics, pollution detection, and some applications of remote sensing.¹ Whether the class imbalance is caused by a lack of collected examples from the minority class or by a natural imbalance in the underlying probability distribution of the data, most classification algorithms like decision trees and neural networks assume a balanced distribution of class labels.¹ It is sometimes even the case that there is limited information about the cost of misidentifying the class label. Thankfully, when detecting credit card fraud we know that there is a large cost associated with not identifying fraudulent transactions soon after they occur. We can also assume there is a smaller and more manageable cost to misidentifying a legitimate transaction as fraudulent. If a transaction is classified as fraudulent, a credit company can flag the account and contact the owner. If it turns out it was a legitimate transaction, then they can unfreeze the accounts and the account owner can usually proceed as if nothing happened.

If a credit company is missing too large of a percentage of fraud, a viable strategy to commit fraud is just to purchase as much as possible with minimal thought about how to get away with it undetected. Criminals that would commit fraud should have to think about how they are going to go undetected. Forcing someone to strategize about about how not to get caught might act as a deterrent and prevent some instances of fraud itself. With this in mind I am going to heavily prioritize fraud detection at the expense of how precisely my model can do this. My viewpoint is that as long as the percentage of correctly classified transactions (the accuracy) remains high enough, we can maximize the percentage of fraudulent transactions detected (the recall for the fraud class) while potentially sacrificing the percentage of predicted fraud that turns out to be actual fraud (the precision for the fraud class).

The approach I take here is to heavily over-sample the fraudulent transaction class using SMOTE and use Tomek Links to lightly under-sample the legitimate transaction class. The most important aspect of this is the over-sampling. However, the way imblearn implements this combination of under- and over-sampling is to first under-sample and then over-sample because it is more efficient. Let’s clarify what we will mean by over-sampling. SMOTE does not simply duplicate instances of the fraud class, though that is something that can be done. For instance, the TensorFlow imbalanced classification tutorial uses the Kaggle Credit Card Fraud dataset, and to handle the class imbalance it simply “amplifies the signal” while training the sequential neural network that is built in that tutorial. The approach is not quite as simple since the duplicated minority class samples are then redistributed into multiple training batches. But, this approach is not preferred in general because we are not actually trying to learn the geometric/spatial distribution of the total collection of credit card frauds with our chosen features.

Tomek Links

For the under-sampling I used Tomek Links. The goal of the under-sampling is to eliminate some of the legitimate transactions that are near the edges of, or are surrounded by, the set of fraudulent transactions. This will make the boundary between the two classes more clear. Under-sampling first will make the generation of new fraudulent transaction members more efficient and will produce more homogeneity with respect to the geometric/spatial distributions of the two classes. The Tomek Links method works by:

  • Isolating a data point x from the minority class.
  • Finding the nearest neighbors to x from the majority class.
  • Eliminating some of the nearest neighbors to x that are from the majority class to help define a clearer boundary between the the majority and minority classes.

The elimination in the third bullet is accomplished by defining a Tomek link. A Tomek link exists between a data point x from the majority class and a data point y from the minority class; if there are no data points z from either class that is closer to x or y than x and y are to each other.

SMOTE

For over sampling I used a technique called SMOTE (Synthetic Minority Over-sampling Technique).² The classical version of SMOTE does the following to create synthetic minority class samples.

  • First, SMOTE chooses a minority class data point.
  • Then, SMOTE uses part of the k-nearest neighbors (kNN) method to find other minority class data points nearby.
  • Then, SMOTE creates synthetic minority samples that lie on the lines connecting the chosen minority class point to some or all of its nearest neighbors of the same class. (For the math terminology crowd out there, this means that SMOTE only generates samples inside the convex hull of the original minority class data points.)

There now exist several modern variations of the classical SMOTE algorithm. These variations are meant to try to deal with a few weaknesses that inherent in the classical method. One of these weaknesses is that the classical SMOTE method assumes that the best representation for the minority class is a continuous (and convex) geometric shape, which may not always be the case. In [3], the authors point out that another weakness of the classical SMOTE technique is that because it is a method which over-samples locally by use of kNN, it could be the case that classical SMOTE is connecting two distinct clusters in the minority class. Lastly, it could also be the case that SMOTE is generally not respecting a non-convex or non-linear geometry that is present in the distribution of the data.

The weaknesses just mentioned may seem like a bummer.

“What? You aren’t presenting a fool-proof method for detecting credit card fraud? WHY AM I EVEN READING THIS?!?!?!”

Don’t worry! It’s still okay to use classical SMOTE like this because there is a relatively low cost to classifying a legitimate transaction as fraudulent compared to the high cost for not detecting a fraudulent transaction. SMOTE will create new synthetic minority class members in the convex hull of our fraudulent transactions in the training set. This will make it much easier for a classifier (like XGBoost) to create a decision boundary that contains most of the fraudulent transactions on one side of that boundary. It is true that the more “solid” polygon-like nature of the new minority sample set will make it more likely that we miss legitimate transactions. But, as long as we don’t misclassify too many transactions overall, this should not deter us from using SMOTE.

Before we move on, the following must be stated. It is very important to not over- or under-sample your whole dataset, especially when generating new data points like we are. Specifically, we cannot use any of the over- or -under-sampling in the test dataset. We want to perform the split between testing and training before we do any over- or under-sampling. The data science terminology for this type of error is data leakage. Care must be taken to avoid doing this. The test set data points must be representative in both the features of the data and the class imbalance or we will not be able to have the confidence to generalize the model to detect credit card fraud on other credit fraud datasets.

Why XGBoost?

XGBoost is a shallow learning method that uses gradient boosting to build a sequence of classification/regression trees which added together as an ensemble create an intelligent, flexible and robust predictive algorithm for many use cases.⁵ To clarify, by shallow we mean that XGBoost does not apply many transformations to the input data to generate features for approximating a decision function directly, like a deep neural network does.³ So, as a shallow learner, an XGBoost model is typically much easier to interpret and explain.

XGBoost is optimized to work with very large datasets that cannot be placed into main memory all at once. Given the severe class imbalance, it is good to use an algorithm like XGBoost that we know can scale up to a much larger sample size in the future. The class imbalance will still exist, but if we were to have a much larger sample overall, then in absolute terms there will be more examples of credit fraud to train a machine to look for. Even if we could obtain more fraud examples in absolute terms, it would probably still be wise to over- and under-sample to correct for the class imbalance. But, given more examples of fraud to work with, we may get even better results from resampling.

A distinct advantage to using XGBoost in a business use-case that involves customer data is that XGBoost uses trees. Classification/Regression trees are often preferred when we do not have all continuous variables, or when there is a desire to keep the model more interpretable. It is easy to imagine that a credit card company looking to predict fraud would want to include demographic and other categorical/discrete variables that may not be able to be replaced by a continuous variable. To clarify, it is not that neural networks can’t handle categorical variables at all, the categoricals just need to be encoded numerically. Even then, it is not really all that mathematically valid. My point is that decision trees use categorical variables in a straightforward and interpretable way. This always makes for more intuitive visualizations in presentations since the final outcome of a decision tree can be interpreted as a flow-chart.³

Variables Used and Hyperparameter Tuning

I first ran a correlation analysis to double check that there were no variables correlated with each other and to see which variables were most highly correlated with the class labels. As it turned out, I also used this correlation analysis to reduce the dimensionality of the training data to supply to the resampling methods. Without reducing the number of columns being fed to the over- and under-sampling method, the run time would have been impractical to put into production with my machine. Work would need to be done to find a way to distribute the computations in the algorithm over many processors. As a sanity check, notice that the 28 x 28 subgrid consisting of the correlations for variables V1 — V28 shows a 0 correlation for each variable pair. Pheew!

We can also check out what happens to a 3-component PCA when we keep these most 15 most highly correlated variables. Note that this plot was made without the combination of under- and over-sampling.

The gap in the body of red dots give us some evidence here that only keeping the 15 variables most correlated with the class labels will make it easier to define a decision boundary. Now let’s see a PCA plot for the under- and over-sampled training set.

This plot is a little misleading. The fraudulent transactions and the legitimate transactions are in the same proportion. However, we can see that the fraudulent transactions are much more solid, giving us a well defined signal with which to train a model. There is still a gap in the fraud dataset, but the proportion of legitimate transactions mixed in with the newly generated fraudulent transactions seems to be much smaller. So, we finally turn to XGBoost.

Running XGBoost out-of-the-box without tuning the hyperparameters leads to performance that might be acceptable if we were not already aware that significantly better performance could be achieved. Tuning hyperparameters is always dabbling in the dark arts. I like to think of the hyperparameters as the assumptions we must teach our learning algorithm to adopt after we watch it train and predict. I say assumptions because hyperparameters are usually not learnable in any way by the algorithm itself. We must learn these for the algorithm. There are two main approaches for learning the best hyperparameters to use for a machine learning algorithm:

  • A broad grid search that leads one to a much more narrow grid search
  • Or use domain knowledge and algorithm understanding to narrow the search first.

Ultimately, a search of some kind must be performed. How much intelligence we actually provide in teaching our learning algorithm the hyperparameters to use is ultimately up to us. For some hyperparameters, like initial bias, it is easy to see how their value will affect performance and this allows us to make very well educated guesses about what they should be. For others it can be extremely difficult or impossible to predict how their value will affect performance ahead of time. In this case it is often best to perform a grid search of some kind, if that is not too computationally expensive. A high quality explanation of hyperparameter tuning and the use of a grid search method can be found here.

Results and Conclusion

First I ran XGBoost with no resampling of the training data and without tuning hyperparameters. I used a 70:30 train:test split, and obtained the following results:

These results are really good for being right out-of-the-box. However, it is always deceptive to be impressed by the overall accuracy for an imbalanced problem like this. For this particular dataset, an overall accuracy of 99.83% can be obtained by simply classifying every transaction as legitimate.

It is clear from looking at the confusion matrix that out-of-the-box XGBoost is doing something that is more than merely random guessing or classifying every test datapoint as one particular class. However, I will show that we can do better. Before we move to do that, let me talk briefly about the train:test split. In this problem, it is interesting, and maybe a little counterintuitive, that if we use an 80:20 train:test split instead of a 70:30 train:test split; all of the recall results obtained are slightly lowered. One of course would hope that increasing the size of the training set would only make results better. An explanation for this is that when we increase the size of the training set we are losing some subset of fraudulent transactions from the test set that an XGBoost model is good at picking up. I have not experimentally verified this yet, so this is just some Boost-food for thought.

The recall score is the rate of fraud detection for this dataset. After the recall, the second most important metric is the overall accuracy. The accuracy will determine how much customers end up interacting with a fraud detection system. Less important, and the metric we will sacrifice, is the precision score for the fraud. We don’t talk about the precision for the legitimate transaction class because it remains very high in all iterations of modeling.

Now, using only the 15 most highly correlated variables from “V1” — “V29”, I ran XGBoost out-of-the-box and obtained the following recall, accuracy and precision scores which are displayed below the classification report and unnormalized confusion matrix.

This is actually really good for doing nothing but keeping the 15 variables more correlated with the class labels. We only see a 2% reduction in fraud detection. We must be disciplined and ignore those percentages in the 90s to increase the fraud detection rate above 90%. We are now going to add in the under- and over-sampling of these 15 variables.

Notice that the fraud detection rate increased by 4% from the first model by under- and over-sampling with the 15 variables most correlated with class labels. This is pretty darn good for not even tuning hyperparameters! You might be tempted to worry about the large drop in precision of the fraud class. As far as performance goes, this is fine. Also, observe that the model is still only predicting 237 instances of fraud out of over 86,000 test set transactions. The metric to monitor here is accuracy. If the overall accuracy were to dip below 90%, then we should strongly consider reverting back to the model given above.

Now we see what happened when I tuned the hyperparameters for XGBoost. I obtained the following results:

The two important take-aways from these results:

  • This model detects credit card fraud at a rate of 92%
  • This model accurately classifies nearly 99% of all transactions.

The credit fraud detection rate has now increased to about 92%, roughly another 4% increase from before we tuned the XGBoost hyperparameters and after we under- and over-sampled. The very large drop in precision displayed in the output above seems scary at first glance. But again, look at the overall accuracy. On average, only 1% of all transactions will be flagged as fraud. We can interpret this as saying that in the long run, most customers would rarely, if ever, interact with the fraud detection system when there was no fraud committed. There are of course questions of whether this particular dataset is representative and whether or not these results would scale well to predicting credit fraud outside the EU. With the data available to me I cannot say anything worth saying in answer to such questions. What I can say is that this particular model does well at catching credit fraud without impacting too many customer experiences on this particular dataset. In a business use-case, that can count for a lot. But the real test would be to see how well under- and over-sampling combined with tuned XGBoost would do on a continuously updating stream of test cases.

[1]: Abd Elrahman, S. M., & Abraham, A. A review of class imbalance problem. Journal of Network and Innovative Computing, 1(2013), 332–340.

[2]: Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. SMOTE: synthetic minority over-sampling technique. arXiv:1106.1813 [cs.AI].

[3]: Chollet, F. Deep Learning with Python. Manning Publications, (2017).

[4]: https://imbalanced-learn.org/stable/over_sampling.html#multi-class-management

[5]: Chen, T., Guestrin, C. XGBoost: A Scalable Tree Boosting System. (2016). https://arxiv.org/abs/1603.02754v3

--

--