The Quality-Utility Paradox: Why High-Reward Data Impairs Small Model Mathematical Reasoning
Summary
Research identifies a "Quality-Utility Paradox" in distilling mathematical reasoning knowledge to Small Language Models (SLMs). Contrary to assumptions, data refined or synthesized by a stronger Oracle model, despite scoring higher with reward models, consistently underperforms traces generated by the SLM itself and selected via rejection sampling. This phenomenon was observed across Qwen2.5, LLaMA-3, and DeepSeek families. Analysis indicates that Oracle refinement, while logically repairing solutions, introduces a distributional drift away from the SLM's native reasoning patterns. This drift increases the SLM's adaptation cost, negating the benefits of improved logic. To address this, the study introduces "Style-Aligned Refinement," a method that preserves the SLM's native trajectory while incorporating Oracle's logical repairs, thereby lowering adaptation cost and restoring utility. These findings emphasize the need to optimize both solution quality and learner-data compatibility in mathematical reasoning distillation.
Key takeaway
For Machine Learning Engineers developing Small Language Models for mathematical reasoning, relying solely on high-reward Oracle-refined data for knowledge distillation can be counterproductive. You should prioritize data compatibility with your SLM's native reasoning distribution, even if it means using SLM-generated data selected via rejection sampling. Consider implementing "Style-Aligned Refinement" to preserve your model's intrinsic style while benefiting from Oracle logic, ensuring better downstream utility and avoiding increased adaptation costs.
Key insights
High-quality Oracle data can impair SLM mathematical reasoning due to distributional drift, despite high reward scores.
Principles
- Reward model scores alone are insufficient for distillation.
- Oracle refinement causes distributional drift.
- Learner-data compatibility is crucial for utility.
Method
Style-Aligned Refinement preserves an SLM's native reasoning trajectory while integrating logical repairs from a stronger Oracle, reducing adaptation cost and improving utility.
In practice
- Prioritize learner-data compatibility in distillation.
- Consider SLM-generated data over Oracle-refined data.
- Implement style-aligned refinement techniques.
Topics
- Small Language Models
- Knowledge Distillation
- Mathematical Reasoning
- Reward Models
- Distributional Drift
- Style-Aligned Refinement
Code references
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.