Platt Scaling Destroyed My Model

2026-03-03 · Source: Valeriy’s Substack · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, short

Summary

A recent study, detailed in arXiv:2601.19944, challenges the common machine learning advice to "always calibrate your probabilities" using Platt scaling. The research, which tested Platt scaling across 21 classifiers, 30 binary datasets, and 150 cross-validation folds, found that applying Platt scaling to already well-calibrated models significantly degrades performance. For instance, CatBoost, TabICL, EBM, and TabPFN, among the best-calibrated models, saw log-loss increase by 5.3%, 6.0%, 4.4%, and 5.0% respectively, worsening predictions on 87-93% of evaluation folds. Platt scaling distorts well-calibrated outputs by compressing tails and reducing sharpness, ultimately making the probability ordering worse. However, it remains beneficial for severely miscalibrated models like XGBoost, Random Forest, and MLPs, where it reduced log-loss by 8.8%, 14.8%, and 24.3% respectively.

Key takeaway

For AI Engineers or Research Scientists evaluating model performance, you should always measure your model's calibration using the Spiegelhalter |Z|-statistic before applying post-hoc methods like Platt scaling. If your model is already well-calibrated (|Z| < 1.96), applying Platt scaling will likely increase log-loss and degrade prediction sharpness, making your model worse. Prioritize diagnostic measurement over blanket application of standard advice to ensure optimal model performance.

Key insights

Blindly applying Platt scaling to well-calibrated models degrades performance by reducing sharpness.

Principles

Platt scaling is a rescue tool for miscalibrated models.
Modern models like CatBoost are often pre-calibrated.
Calibration has a cost; measure first.

Method

Assess model calibration using the Spiegelhalter |Z|-statistic on a held-out set. If |Z| < 1.96, do not apply Platt scaling. If |Z| > 3.0, Platt scaling will help, but Venn-Abers is superior.

In practice

Calculate |Z|-statistic before applying calibration.
Avoid Platt scaling for CatBoost and similar models.
Consider Venn-Abers for highly miscalibrated models.

Topics

Platt Scaling
Model Calibration
Log-Loss
Spiegelhalter Z-statistic
CatBoost

Best for: AI Engineer, NLP Engineer, Research Scientist, Machine Learning Engineer, Data Scientist, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Valeriy’s Substack.