Pocket Foundation Models: Distilling TFMs into CPU-Ready Gradient-Boosted Trees
Summary
Researchers have developed "Pocket Foundation Models" by distilling large Tabular Foundation Models (TFMs) into CPU-ready gradient-boosted trees like XGBoost or CatBoost, addressing the critical need for sub-2ms inference times in applications such as fraud scoring. The primary challenge of label leakage during in-context learning (ICL) in TFMs was overcome using stratified out-of-fold (OOF) teacher labeling. This method enabled the distillation of TabICLv2 into XGBoost, achieving a 0.882 macro-mean AUC (96.5% of the teacher's AUC) with an inference time of 1.9 ms on CPU. This represents a 38x to 860x speedup compared to GPU-based TFMs. The full distillation pipeline is open-sourced as part of the TabTune library.
Key takeaway
For research scientists developing high-performance tabular models, you should investigate distilling TFMs into CPU-native gradient-boosted trees to meet stringent latency requirements. This approach, particularly effective for low-dimensional data, can deliver significant speedups while retaining high accuracy. Be cautious on high-dimensional tasks where the teacher itself underperforms, as distillation may worsen results.
Key insights
Distilling large tabular foundation models into CPU-native gradient-boosted trees significantly reduces inference latency.
Principles
- Teacher rank transfers exactly to student rank.
- Distillation gains concentrate on low-dimensional data.
- Multi-teacher averaging helps MLP students more than tree students.
Method
Distill in-context learning (ICL) teachers into XGBoost or CatBoost students, preventing label leakage by using stratified out-of-fold (OOF) teacher labeling to generate soft targets.
In practice
- Use stratified OOF labeling for ICL teacher distillation.
- Prioritize distillation for low-dimensional tabular datasets.
- Consider multi-teacher averaging for MLP-based student models.
Topics
- Pocket Foundation Models
- Tabular Foundation Models
- Knowledge Distillation
- Gradient-Boosted Trees
- In-Context Learning
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.