An Information-Theoretic Criterion for Efficient Data Synthesis

· Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

Hanyu Li, Zhengqi Sun, and Xiaotie Deng introduce an information-theoretic framework to explain the inconsistent effectiveness of synthetic data in large language model (LLM) training. They propose that synthetic data improves models only when the generation-training loop is "information-open," meaning it incorporates external signals like verifiers or rubrics that inject task-relevant information beyond the model's current distribution. Conversely, "information-closed" loops, which rely solely on the model's own outputs, lead to distribution collapse due to the data processing inequality, which states that task-relevant information can only decrease. The authors also argue that efficiency and generalization in information-open pipelines depend on the "meta-level" of supervision; coarser signals, such as binary correctness, generalize better because they focus on task-relevant distinctions rather than specific surface forms. This leads to a thesis that learning prioritizes the most information-efficient signal component, which can accelerate learning or cause reward hacking if spurious patterns are simpler.

Key takeaway

For AI Engineers designing LLM training pipelines, understanding the information-theoretic properties of synthetic data is crucial. You should prioritize integrating robust, external verification signals to maintain an "information-open" loop, preventing model collapse. Focus on high meta-level supervision, such as binary correctness checks, to maximize sample efficiency and cross-domain generalization, as this approach concentrates learning on invariant task criteria rather than specific output forms. Be wary of spurious, high-efficiency signals that can lead to reward hacking.

Key insights

Synthetic data is effective only when external signals keep the training loop "information-open."

Principles

Method

The paper formalizes training as a Markov chain $X\to D\to Z$ and refines the Data Processing Inequality by introducing an external signal $S$, leading to $I(X;Z)\leq I(X;D,S)$.

In practice

Topics

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.