The fallacy of predict_proba
Summary
The article addresses the common misconception that `model.predict_proba(X)` outputs true probabilities, asserting that these are merely transformations of model scores between 0 and 1. It clarifies that while `predict_proba` values are monotone in confidence, they do not inherently represent calibration, which is the property where predicted probabilities match actual long-run frequencies. The Spiegelhalter Z statistic is presented as a method to measure calibration, with values like |Z| > 1.96 indicating miscalibration. This distinction is crucial for threshold decisions, cost-sensitive scoring, and accurate risk reporting. A benchmark study, *Classifier Calibration at Scale*, evaluated five calibrators, finding that Platt scaling and isotonic regression can degrade performance on modern tabular models. Instead, Venn–Abers predictors showed the largest log-loss reductions and provide coverage guarantees, while Beta calibration was the most consistently helpful single-number calibrator. Conformal classification is offered as an alternative for obtaining prediction sets with guaranteed coverage.
Key takeaway
For Machine Learning Engineers deploying classification models, if your system relies on `predict_proba` values as true probabilities for thresholding, cost functions, or risk reporting, you must explicitly measure calibration on held-out data. Assuming calibration can lead to significant accuracy costs in deployment. Consider applying Venn–Abers predictors for robust probability coverage or Beta calibration for a single, more reliable probability, as common methods like Platt scaling may degrade performance.
Key insights
`predict_proba` outputs model scores, not calibrated probabilities, impacting downstream decisions and requiring explicit measurement.
Principles
- Calibration must be measured, not assumed.
- `predict_proba` outputs model scores, not true probabilities.
- Platt and isotonic scaling can degrade modern model performance.
Method
Measure calibration using the Spiegelhalter Z statistic. Apply Venn–Abers or Beta calibration on a held-out set for improved proper scoring.
In practice
- Test `predict_proba` output for calibration.
- Use Venn–Abers for probability coverage.
Topics
- Model Calibration
- "predict_proba" Fallacy
- Conformal Prediction
- Venn–Abers Predictors
- Beta Calibration
- Spiegelhalter Z Statistic
Best for: AI Engineer, Research Scientist, Machine Learning Engineer, Data Scientist, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Valeriy’s Substack.