Stop Misusing ROC Curve and GINI: Navigate Imbalanced Datasets with Confidence

Discover how the Precision-Recall curve can provide a more robust metric for binary classification in data science and machine learning.

Angel Igareta
Klarna Engineering

--

Imagine stepping into the complex world of binary classification problems. As a Senior Data Scientist at Klarna, this is my day-to-day reality. Binary classification is a cornerstone of data science, with applications touching everything from credit default predictions to medical diagnoses and spam detection. Yet, these problems come with their own unique set of challenges.

Metrics such as the GINI coefficient and ROC_AUC often serve as our compass in this maze. They are widely trusted and used for evaluating models. But here’s the catch: they might not always point us in the right direction. Can we rely on them blindly, or do we need to dig deeper?

The path gets even more challenging when we encounter imbalanced datasets. In such cases, the effectiveness of our trusted metrics can be seriously compromised.

In this post, I invite you to join me on a journey to explore these metrics in greater depth. We will question their effectiveness, understand their limitations, and reveal alternatives that could prove to be more reliable navigational tools in the world of binary classification problems.

Understanding Model Predictions and Metrics

To truly grasp the nuances of model evaluation, let’s start by setting the stage with a real-world scenario that we often encounter at Klarna.

Imagine we’re tasked with predicting customer loan defaults. We have two categories to consider — paid or default. However, in our scenario, the default rate is a mere 2%. This is a classic case of data imbalance, and it’s exactly the kind of challenge we’re up against.

To evaluate our model’s performance in this scenario, we need to understand its predictions. We break these down into four distinct outcomes, also known as the confusion matrix:

  1. True Positives (TP): These are the customers who our model correctly identifies as defaulters.
  2. False Positives (FP): These are the customers who our model incorrectly flags as defaulters. Such errors can result in losses of customer lifetime value.
  3. True Negatives (TN): These are the customers who our model correctly identifies as non-defaulters.
  4. False Negatives (FN): These are the customers who our model incorrectly flags as non-defaulters. Such errors can lead to financial losses due to the average loss and recovery rate.

With these categories in place, we can delve into the metrics we use to measure our model’s performance. In real-world applications, Data Scientists often aim to provide the best model within certain constraints, such as a range of risk profiles (like default rates). They don’t fixate on a single threshold. The Underwriting teams then pick a threshold based on the company’s current risk appetite and objectives over a given period.

Hence, we’ll skip the popular measures like Accuracy or F1-Score and instead we’ll focus on threshold-agnostic metrics, which gauge the model’s ability to distinguish between categories, regardless of the chosen threshold.

ROC Curve

Let’s delve deeper into one of the most widely used threshold-agnostic metrics — ROC_AUC, which stands for Receiver Operating Characteristic Area Under the Curve. This metric uses a graphical representation to provide a comprehensive view of our model’s predictive capabilities.

The ROC curve plots True Positive Rate (TPR) on the y-axis and False Positive Rate (FPR) on the x-axis. This gives us a clear view of the trade-off between TPR and FPR at different thresholds of the confusion matrix, offering a holistic understanding of our model’s performance across varying cutoff points.

This graphical representation details an example of a Receiver Operating Characteristic Area Under the Curve (ROC_AUC). The graph plots the True Positive Rate (TPR) on the y-axis against the False Positive Rate (FPR) on the x-axis. The graph features several curves, each representing a different classifier’s performance. The ‘random classifier’ curve, which represents a baseline model, is also highlighted. The curves of better performing models are closer to the top left corner.
ROC Curve Illustration: Comparing Classifier Performances from Best to Worst

But what exactly are TPR and FPR? Let’s break it down:

  1. True Positive Rate (TPR): This is the proportion of actual defaulters that our model correctly identifies. Mathematically, TPR is calculated as TP / (TP + FN). Another way to think about it is, “If a customer defaults, what’s the chance our model will catch it?”
  2. False Positive Rate (FPR): This is the proportion of actual non-defaulters that our model mistakenly identifies as defaulters. Mathematically, FPR is calculated as FP / (FP + TN). You can interpret it as, “If a customer didn’t default, what’s the likelihood our model incorrectly marks them as a defaulter?”

The area under the ROC curve gives us the ROC_AUC score. A higher score indicates that our model is not only accurate in its predictions (high TPR) but also minimizes false alarms (low FPR) across all thresholds in the confusion matrix.

GINI Coefficient

Derived from ROC_AUC, the GINI coefficient provides a simple yet insightful measure of a model’s performance. It ranges from 0, which signifies a model with no discriminative power, to 1, indicating perfect discrimination between classes. The formula for calculation is: GINI = 2 * ROC_AUC — 1.

For stakeholders, the GINI coefficient offers a quick, single-figure snapshot of model effectiveness. Unlike the ROC_AUC, which requires a deeper understanding of true and false positives and negatives, the GINI coefficient provides a more immediate sense of model performance, making it a popular choice for non-technical audiences.

However, both the ROC_AUC and the GINI coefficient share a common pitfall.

While they excel in evaluating models for balanced datasets, they can be misleading when dealing with imbalanced datasets. This is because they don’t take into account the ratio of the positive and negative classes.

Consequently, these metrics can yield a high score even when the model is performing poorly on the minority class. In an unbalanced situation, a model may resort to always predicting the majority class to achieve a high score. This strategy, though it inflates the performance metrics, fails to provide any meaningful insight into the minority class, which could be critical in scenarios like fraud detection or rare disease diagnosis where the minority class is of greater interest.

Yet, there’s no need for concern. We have a remedy for this situation: Enter PR_AUC. This metric provides a more reliable evaluation for imbalanced datasets, which we’ll explore next.

Precision Recall Curve

Let’s delve into PR_AUC (Precision-Recall Area Under the Curve). Like the ROC curve, it’s a graphical representation of model performance. But it uses Precision and Recall.

Let’s break down these two components:

  1. Precision: The ratio of true positive outcomes (correctly identified defaulters in our case) to all positive outcomes predicted by the model. It’s calculated as TP / (TP + FP). You can view it as, “When our model flags a customer as a defaulter, what’s the probability they really are?”
  2. Recall (or True Positive Rate): Same component we discussed in the context of ROC curve. The proportion of actual defaulters correctly identified by the model. It’s the answer to, “Out of all the customers who defaulted, what portion did our model manage to identify?”

The PR curve, with Precision on the y-axis and Recall on the x-axis, illustrates the trade-off between these two metrics for different threshold values, akin to the ROC curve’s TPR and FPR trade-off view.

This graphical representation details an example of a Precision Recall Area Under the Curve (PR_AUC). The graph plots the Precision on the y-axis against the Recall (True Positive Rate) on the x-axis. The graph features several curves, each representing a different classifier’s performance. The ‘random classifier’ curve, which represents a baseline model, is also highlighted. The curves of better performing models are closer to the top right corner.
PR Curve Illustration with a target incidence of 0.1: Comparing Classifier Performances from Best to Worst.

The PR_AUC score, derived from the area under the PR curve, indicates the model’s accuracy (high precision) and its ability to detect a large portion of actual defaulters (high recall). However, the score’s dependence on the target incidence (proportion of positive cases in the dataset) may make it less intuitive for those who are not well-versed in the model evaluation.

The PR_AUC score offers a detailed insight into the model’s performance, especially on imbalanced datasets, proving to be an invaluable metric for data scientists and analysts.

Difference Between ROC_AUC and PR_AUC

Even though they both serve as powerful tools for visualizing a model’s performance, they differ in their underlying metrics and how they interpret a model’s effectiveness.

Here’s a brief refresher on the metrics used in both curves:

ROC_AUC

  • True Positive Rate (TPR): TP / (TP + FN)
  • False Positive Rate (FPR): FP / (FP + TN)

PR_AUC

  • Precision: TP / (TP + FP)
  • Recall: TP / (TP + FN)

The impact of imbalanced datasets becomes critical when the negative class significantly outnumbers the positive class, leading to an abundance of TNs. This imbalance can distort metrics like ROC_AUC, which incorporates the FPR in its calculation.

A high number of TNs can result in a misleadingly low FPR, even with many False Positives, thereby inflating the ROC_AUC score and painting an overly optimistic picture of the model’s performance.

PR_AUC, unlike ROC_AUC, doesn’t consider TN. It focuses on the model’s precision and recall, providing a more realistic evaluation of performance on imbalanced datasets by concentrating on the minority class.

As we’ve unraveled these two powerful metrics, you’re now equipped with the knowledge to discern which one is best suited for your unique dataset and business problem.

Next, we’ll apply these concepts to a practical problem, bringing these metrics to life with real-world data.

Practical Illustration: The Tale of Two Models

To bring these concepts to life, let’s consider two models trained on the same imbalanced dataset. We have two contenders in the ring:

  • Model A: A Logistic Regression model, simple yet effective.
  • Model B: A Gradient Boosting model, known for its precision and handling of complex datasets.

Both models were trained on a dataset with a 2% default rate, a classic case of imbalance, and tested on a set of 10,000 customers.

As explained before, performance metrics are usually calculated by integrating over all possible thresholds, not just a single point. However, for simplicity in this illustration, we will consider the errors given a fixed threshold and a simplified ROC_AUC and PR_AUC calculations.

Following the evaluation of the models, the performance of each can be summarized as such:

Model A’s Performance Card

| Confusion Matrix      | Predicted Non-Defaulters | Predicted Defaulters |
|-----------------------|:------------------------:|:--------------------:|
| Actual Non-Defaulters | 9600 (TN) | 200 (FP) |
| Actual Defaulters | 100 (FN) | 100 (TP) |

Model B’s Performance Card

| Confusion Matrix      | Predicted Non-Defaulters | Predicted Defaulters |
|-----------------------|:------------------------:|:--------------------:|
| Actual Non-Defaulters | 9700 (TN) | 100 (FP) |
| Actual Defaulters | 100 (FN) | 100 (TP) |

We now proceed to calculate the simplified formula for the ROC and PR area under the curves for both models (assuming just one data point).

Model A’s Performance Metrics

  • ROC_AUC: Approximately 49%, calculated as TPR * (1 — FPR) = 0.5 * (1–0.0204) = 0.4898.
  • PR_AUC: Approximately 17%, calculated as Precision * Recall = 0.3333 * 0.5 = 0.16665

Model B’s Performance Metrics

  • ROC_AUC: Approximately 49%, calculated as TPR * (1 — FPR) = 0.5 * (1–0.0102) = 0.4949.
  • PR_AUC: Approximately 25%, calculated as Precision * Recall = 0.5 * 0.5 = 0.25.

Comparison

When comparing both models, if we were to just look at the ROC_AUC scores, which are 49% for both Model A and Model B, we might erroneously conclude that both models perform identically.

However, the ROC_AUC metric, with its high sensitivity to True Negatives, can distort our understanding, especially when dealing with imbalanced datasets like ours. Neglecting this subtlety could result in rejecting good customers, consequently missing out on their potential lifetime value.

Despite Model A incorrectly predicting 100 more customers as defaulters than Model B, the identical ROC_AUC scores would suggest equivalent performance.

This is where the PR_AUC metric comes into play. With scores of 17% for Model A and 25% for Model B, this metric, with its emphasis on False Positives, unveils a different scenario — Model B outperforms Model A in correctly classifying the minority class.

Conclusions

In this post, we explored the complexities of model evaluation for binary classification problems with imbalanced datasets. We discussed the limitations of popular metrics like ROC_AUC and GINI, which can be misleading in such scenarios, and introduced PR_AUC as a more reliable alternative.

Through a practical example, we demonstrated how metric choice impacts the interpretation of model performance. We highlighted the importance of selecting a metric that aligns with your dataset and problem to ensure accurate model evaluation.

In conclusion, the key takeaway is the importance of understanding your data and the problem at hand. Choosing the right metric that aligns with your dataset and problem can lead to more accurate measurements of your model’s performance and provide meaningful insights. Remember, there’s no one-size-fits-all solution in data science. It’s all about finding the right tool for the job.

Angel Igareta, Senior Data Scientist
Medium | LinkedIn | GitHub

Did you enjoy this post and want to stay updated on our latest projects and advancements in the engineering field? Join the Klarna Engineering community on Medium, Meetup.com and LinkedIn.

--

--

Passionate about digital innovation. My goal is to use the data being collected in different domains to create new solutions that have a real impact.