Platt Scaling Destroyed My Model
Summary
A recent study, detailed in arXiv:2601.19944, challenges the common machine learning advice to "always calibrate your probabilities" using Platt scaling. The research, which tested Platt scaling across 21 classifiers, 30 binary datasets, and 150 cross-validation folds, found that applying Platt scaling to already well-calibrated models significantly degrades performance. For instance, CatBoost, TabICL, EBM, and TabPFN, among the best-calibrated models, saw log-loss increase by 5.3%, 6.0%, 4.4%, and 5.0% respectively, worsening predictions on 87-93% of evaluation folds. Platt scaling distorts well-calibrated outputs by compressing tails and reducing sharpness, ultimately making the probability ordering worse. However, it remains beneficial for severely miscalibrated models like XGBoost, Random Forest, and MLPs, where it reduced log-loss by 8.8%, 14.8%, and 24.3% respectively.
Key takeaway
For AI Engineers or Research Scientists evaluating model performance, you should always measure your model's calibration using the Spiegelhalter |Z|-statistic before applying post-hoc methods like Platt scaling. If your model is already well-calibrated (|Z| < 1.96), applying Platt scaling will likely increase log-loss and degrade prediction sharpness, making your model worse. Prioritize diagnostic measurement over blanket application of standard advice to ensure optimal model performance.
Key insights
Blindly applying Platt scaling to well-calibrated models degrades performance by reducing sharpness.
Principles
- Platt scaling is a rescue tool for miscalibrated models.
- Modern models like CatBoost are often pre-calibrated.
- Calibration has a cost; measure first.
Method
Assess model calibration using the Spiegelhalter |Z|-statistic on a held-out set. If |Z| < 1.96, do not apply Platt scaling. If |Z| > 3.0, Platt scaling will help, but Venn-Abers is superior.
In practice
- Calculate |Z|-statistic before applying calibration.
- Avoid Platt scaling for CatBoost and similar models.
- Consider Venn-Abers for highly miscalibrated models.
Topics
- Platt Scaling
- Model Calibration
- Log-Loss
- Spiegelhalter Z-statistic
- CatBoost
Best for: AI Engineer, NLP Engineer, Research Scientist, Machine Learning Engineer, Data Scientist, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Valeriy’s Substack.