LLMs on Tabular Data with Limited Semantics: Evidence from Industrial Car Retrofit Prediction

2026-06-16 · Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, long

Summary

This study investigates the application of Large Language Models (LLMs) to structured industrial data for car retrofit prediction, specifically for BMW Group. Researchers analyzed a dataset of 284,271 prototype vehicles and 48,716 retrofit visits, where categorical values were hashed to remove semantic cues. They compared strong tabular machine learning baselines with three LLM strategies: embedding features (Amazon Titan Embed v2), direct prompted classification (Claude Sonnet 4), and an ML+LLM stacking approach. While classical tree ensembles remained the strongest standalone models, LLM embeddings proved useful (binary AUC = 0.982). Direct prompting, however, collapsed to random performance (binary AUC = 0.500) due to the lack of semantic signal. The hybrid stacking model achieved the best manually built multiclass performance (weighted F1 = 0.626), suggesting LLMs are more effective as complementary components than as standalone replacements for robust tabular baselines.

Key takeaway

For Machine Learning Engineers building predictive systems on privacy-constrained industrial tabular data, you should prioritize robust classical models like gradient-boosted trees. While direct LLM prompting is ineffective without semantic content, consider integrating LLM embeddings or outputs into a hybrid stacking architecture. This approach can provide complementary signal, as demonstrated by a multiclass weighted F1 of 0.626, enhancing overall system performance without incurring the high costs and latency of full LLM replacement.

Key insights

LLMs complement, but do not replace, strong tabular models on privacy-constrained industrial data lacking semantic cues.

Principles

Hashed categorical data severely degrades direct LLM prompting.
LLM embeddings can capture useful structural patterns in tabular data.
Hybrid ML+LLM stacking can exploit complementary error patterns.

Method

Serialize structured rows into key-value text, embed with an LLM, or use for direct prompted classification, then combine with classical models via stacking.

In practice

Use LLM embeddings as features for classical models.
Integrate LLM outputs as meta-features in stacking ensembles.

Topics

Large Language Models
Tabular Data
Industrial Prediction
Retrofit Planning
Hybrid ML Systems
Time Series Forecasting

Code references

aina-vila-pons/retrofit-forecast-pipeline

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.