Still Using predict_proba? Here’s Why Your Probabilities Are Lying to You
Summary
Scikit-learn's `predict_proba` function, widely used in machine learning, does not return true probabilities but rather scores that sum to one and lie between 0 and 1. This misinterpretation can lead to significant errors in critical applications like risk pricing, patient triaging, and fraud detection. An experiment using Logistic Regression, Random Forest, and Gradient Boosting on a 10,000-sample dataset revealed substantial miscalibration, with Random Forest showing a maximum deviation of 0.223 and Gradient Boosting 0.177. Even Logistic Regression, often considered well-calibrated, exhibits systematic overconfidence with calibration error scaling as Θ(d/n). This occurs because `predict_proba` optimizes for discrimination, not calibration, meaning high AUC or accuracy does not guarantee calibrated probabilities.
Key takeaway
For AI Engineers and Data Scientists building classification systems, you must verify the calibration of your `predict_proba` outputs before using them for critical decisions. Your models, even strong performers like CatBoost, may be providing uncalibrated scores, leading to underpriced policies, false medical certainty, or undetected fraud. Plot reliability diagrams and calculate Brier scores to assess calibration, and consider implementing Venn-Abers or Beta calibration for mathematically guaranteed or computationally efficient probability accuracy.
Key insights
Scikit-learn's `predict_proba` outputs are scores, not calibrated probabilities, leading to critical decision-making errors.
Principles
- Discrimination and calibration are distinct model properties.
- High AUC or accuracy does not imply calibration.
- Standard training does not guarantee probability calibration.
Method
Post-hoc calibration methods like Venn-Abers predictors and Beta calibration can improve probabilistic quality, with Venn-Abers offering distribution-free validity guarantees and Beta calibration providing near-zero inference latency impact.
In practice
- Plot reliability diagrams to check model calibration.
- Compute Brier score to quantify probability error.
- Consider Venn-Abers or Beta calibration for production.
Topics
- Probability Calibration
- Scikit-learn predict_proba
- Reliability Diagrams
- Post-hoc Calibration
- Venn-Abers Predictors
Best for: AI Engineer, NLP Engineer, Computer Vision Engineer, Machine Learning Engineer, Data Scientist, AI Researcher
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Valeriy’s Substack.