Pocket Foundation Models: Distilling TFMs into CPU-Ready Gradient-Boosted Trees

2026-05-18 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Software Development & Engineering · Depth: Advanced, quick

Summary

Researchers have developed "Pocket Foundation Models" by distilling large Tabular Foundation Models (TFMs) into CPU-ready gradient-boosted trees like XGBoost or CatBoost, addressing the critical need for sub-2ms inference times in applications such as fraud scoring. The primary challenge of label leakage during in-context learning (ICL) in TFMs was overcome using stratified out-of-fold (OOF) teacher labeling. This method enabled the distillation of TabICLv2 into XGBoost, achieving a 0.882 macro-mean AUC (96.5% of the teacher's AUC) with an inference time of 1.9 ms on CPU. This represents a 38x to 860x speedup compared to GPU-based TFMs. The full distillation pipeline is open-sourced as part of the TabTune library.

Key takeaway

For research scientists developing high-performance tabular models, you should investigate distilling TFMs into CPU-native gradient-boosted trees to meet stringent latency requirements. This approach, particularly effective for low-dimensional data, can deliver significant speedups while retaining high accuracy. Be cautious on high-dimensional tasks where the teacher itself underperforms, as distillation may worsen results.

Key insights

Distilling large tabular foundation models into CPU-native gradient-boosted trees significantly reduces inference latency.

Principles

Teacher rank transfers exactly to student rank.
Distillation gains concentrate on low-dimensional data.
Multi-teacher averaging helps MLP students more than tree students.

Method

Distill in-context learning (ICL) teachers into XGBoost or CatBoost students, preventing label leakage by using stratified out-of-fold (OOF) teacher labeling to generate soft targets.

In practice

Use stratified OOF labeling for ICL teacher distillation.
Prioritize distillation for low-dimensional tabular datasets.
Consider multi-teacher averaging for MLP-based student models.

Topics

Pocket Foundation Models
Tabular Foundation Models
Knowledge Distillation
Gradient-Boosted Trees
In-Context Learning

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.