The Quality-Utility Paradox: Why High-Reward Data Impairs Small Model Mathematical Reasoning

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Research identifies a "Quality-Utility Paradox" in distilling mathematical reasoning knowledge to Small Language Models (SLMs). Contrary to assumptions, data refined or synthesized by a stronger Oracle model, despite scoring higher with reward models, consistently underperforms traces generated by the SLM itself and selected via rejection sampling. This phenomenon was observed across Qwen2.5, LLaMA-3, and DeepSeek families. Analysis indicates that Oracle refinement, while logically repairing solutions, introduces a distributional drift away from the SLM's native reasoning patterns. This drift increases the SLM's adaptation cost, negating the benefits of improved logic. To address this, the study introduces "Style-Aligned Refinement," a method that preserves the SLM's native trajectory while incorporating Oracle's logical repairs, thereby lowering adaptation cost and restoring utility. These findings emphasize the need to optimize both solution quality and learner-data compatibility in mathematical reasoning distillation.

Key takeaway

For Machine Learning Engineers developing Small Language Models for mathematical reasoning, relying solely on high-reward Oracle-refined data for knowledge distillation can be counterproductive. You should prioritize data compatibility with your SLM's native reasoning distribution, even if it means using SLM-generated data selected via rejection sampling. Consider implementing "Style-Aligned Refinement" to preserve your model's intrinsic style while benefiting from Oracle logic, ensuring better downstream utility and avoiding increased adaptation costs.

Key insights

High-quality Oracle data can impair SLM mathematical reasoning due to distributional drift, despite high reward scores.

Principles

Method

Style-Aligned Refinement preserves an SLM's native reasoning trajectory while integrating logical repairs from a stronger Oracle, reducing adaptation cost and improving utility.

In practice

Topics

Code references

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.