The fallacy of predict_proba

· Source: Valeriy’s Substack · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, medium

Summary

The article addresses the common misconception that `model.predict_proba(X)` outputs true probabilities, asserting that these are merely transformations of model scores between 0 and 1. It clarifies that while `predict_proba` values are monotone in confidence, they do not inherently represent calibration, which is the property where predicted probabilities match actual long-run frequencies. The Spiegelhalter Z statistic is presented as a method to measure calibration, with values like |Z| > 1.96 indicating miscalibration. This distinction is crucial for threshold decisions, cost-sensitive scoring, and accurate risk reporting. A benchmark study, *Classifier Calibration at Scale*, evaluated five calibrators, finding that Platt scaling and isotonic regression can degrade performance on modern tabular models. Instead, Venn–Abers predictors showed the largest log-loss reductions and provide coverage guarantees, while Beta calibration was the most consistently helpful single-number calibrator. Conformal classification is offered as an alternative for obtaining prediction sets with guaranteed coverage.

Key takeaway

For Machine Learning Engineers deploying classification models, if your system relies on `predict_proba` values as true probabilities for thresholding, cost functions, or risk reporting, you must explicitly measure calibration on held-out data. Assuming calibration can lead to significant accuracy costs in deployment. Consider applying Venn–Abers predictors for robust probability coverage or Beta calibration for a single, more reliable probability, as common methods like Platt scaling may degrade performance.

Key insights

`predict_proba` outputs model scores, not calibrated probabilities, impacting downstream decisions and requiring explicit measurement.

Principles

Method

Measure calibration using the Spiegelhalter Z statistic. Apply Venn–Abers or Beta calibration on a held-out set for improved proper scoring.

In practice

Topics

Best for: AI Engineer, Research Scientist, Machine Learning Engineer, Data Scientist, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Valeriy’s Substack.