Ensembling Tabular Foundation Models - A Diversity Ceiling And A Calibration Trap

2026-05-18 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

Tabular Foundation Models (TFMs) are increasingly competitive with gradient-boosted trees on tabular tasks, yet no single TFM consistently outperforms others. Ensembling these models, a common strategy, shows limited effectiveness. A study benchmarking six modern TFMs and six ensemble strategies across 153 OpenML classification tasks revealed that these TFMs form a near-redundant pool, with a mean pairwise Q-statistic of $0.961$. The most effective ensemble, two-level cascade stacking, achieved only a $+0.18\%$ accuracy improvement over the best single TFM, but at $253\times$ the computational cost. Notably, stacking with a logistic-regression meta-learner improved accuracy by sharpening class boundaries, which simultaneously degraded model calibration, leading to poor log-loss performance despite competitive accuracy and ROC-AUC.

Key takeaway

For AI Engineers evaluating ensemble strategies for Tabular Foundation Models, you should prioritize greedy selection over complex stacking methods. While two-level cascade stacking offers a marginal accuracy boost, its $253\times$ compute cost is generally prohibitive. Be aware that stacking with logistic regression can improve accuracy but severely compromises calibration, making it unsuitable for applications where well-calibrated probabilities are essential.

Key insights

Ensembling modern Tabular Foundation Models yields minimal gains due to high redundancy and calibration issues.

Principles

High TFM redundancy limits ensemble gains.
Sharpening class boundaries harms calibration.

Method

Benchmarked six TFMs and six ensemble strategies on 153 OpenML classification tasks, using Friedman and Nemenyi analysis to compare performance and identify equivalence groups.

In practice

Greedy selection is a practical default.
Avoid stacking if calibration is critical.

Topics

Tabular Foundation Models
Ensemble Learning
Model Diversity
Model Calibration
Stacking Ensembles

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.