From prediction sets to calibrated class scores
Summary
Conformal classification, while known for producing prediction sets, also simultaneously generates calibrated p-values for every class, a critical but often overlooked aspect for tabular machine learning practitioners. Unlike scikit-learn's `predict_proba` outputs, which are not formally calibrated, conformal p-values offer finite-sample, distribution-free calibration guarantees under exchangeability. This means that for a true label `y`, the p-value `p_y(x)` falls below any threshold `t` with probability at most `t`. This dual output stems from the identical calibration procedure, where the prediction set `Cα(x)=y:py(x)>α` is simply a level-α cut of the conformal p-value vector. This vector of calibrated scores is highly valuable for downstream tabular models in applications like credit, fraud, or churn, where calibrated inputs are essential for accurate cost-sensitive decisions.
Key takeaway
For Machine Learning Engineers building tabular classifiers, understanding that conformal prediction provides formally calibrated p-values as a byproduct of prediction set generation is crucial. This capability offers a significant advantage over uncalibrated `predict_proba` outputs, ensuring downstream models receive reliable, calibrated scores for critical applications like fraud detection or ranking. Integrate conformal p-value vectors into your pipelines to improve the accuracy of cost-sensitive decisions and meta-models.
Key insights
Conformal classification simultaneously yields both prediction sets and formally calibrated p-values, resolving a critical gap in tabular ML.
Principles
- Prediction sets are level-α cuts of conformal p-value vectors.
- Conformal p-values are formally calibrated, unlike `predict_proba` outputs.
Method
Hold out a calibration set, compute nonconformity scores, then derive conformal p-values for each class, which are guaranteed to be calibrated and can be shipped as a vector.
In practice
- Ship conformal p-value vectors to downstream tabular models.
- Use hinge scores `1 - p_true(x_i)` for calibration in infrequent-event problems.
Topics
- Conformal Prediction
- Calibrated Class Scores
- Prediction Sets
- Tabular Machine Learning
- Conformal p-values
Best for: Machine Learning Engineer, Data Scientist, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Valeriy’s Substack.