An Information-Theoretic Criterion for Efficient Data Synthesis
Summary
Hanyu Li, Zhengqi Sun, and Xiaotie Deng introduce an information-theoretic framework to explain the inconsistent effectiveness of synthetic data in large language model (LLM) training. They propose that synthetic data improves models only when the generation-training loop is "information-open," meaning it incorporates external signals like verifiers or rubrics that inject task-relevant information beyond the model's current distribution. Conversely, "information-closed" loops, which rely solely on the model's own outputs, lead to distribution collapse due to the data processing inequality, which states that task-relevant information can only decrease. The authors also argue that efficiency and generalization in information-open pipelines depend on the "meta-level" of supervision; coarser signals, such as binary correctness, generalize better because they focus on task-relevant distinctions rather than specific surface forms. This leads to a thesis that learning prioritizes the most information-efficient signal component, which can accelerate learning or cause reward hacking if spurious patterns are simpler.
Key takeaway
For AI Engineers designing LLM training pipelines, understanding the information-theoretic properties of synthetic data is crucial. You should prioritize integrating robust, external verification signals to maintain an "information-open" loop, preventing model collapse. Focus on high meta-level supervision, such as binary correctness checks, to maximize sample efficiency and cross-domain generalization, as this approach concentrates learning on invariant task criteria rather than specific output forms. Be wary of spurious, high-efficiency signals that can lead to reward hacking.
Key insights
Synthetic data is effective only when external signals keep the training loop "information-open."
Principles
- Information-closed loops lead to model collapse.
- Higher meta-level supervision improves efficiency and generalization.
- Learning converges to the most information-efficient signal.
Method
The paper formalizes training as a Markov chain $X\to D\to Z$ and refines the Data Processing Inequality by introducing an external signal $S$, leading to $I(X;Z)\leq I(X;D,S)$.
In practice
- Prioritize external verifiers for synthetic data pipelines.
- Design supervision signals at a high meta-level (e.g., binary correctness).
- Ensure prompt diversity over data volume for generalization.
Topics
- Synthetic Data Effectiveness
- Information-Theoretic Criterion
- Data Processing Inequality
- Information-Open Loops
- Meta-Level Supervision
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.