The Quality-Utility Paradox: Why High-Reward Data Impairs Small Model Mathematical Reasoning

2026-06-15 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Research identifies a "Quality-Utility Paradox" in distilling mathematical reasoning knowledge to Small Language Models (SLMs). Contrary to assumptions, data refined or synthesized by a stronger Oracle model, despite scoring higher with reward models, consistently underperforms traces generated by the SLM itself and selected via rejection sampling. This phenomenon was observed across Qwen2.5, LLaMA-3, and DeepSeek families. Analysis indicates that Oracle refinement, while logically repairing solutions, introduces a distributional drift away from the SLM's native reasoning patterns. This drift increases the SLM's adaptation cost, negating the benefits of improved logic. To address this, the study introduces "Style-Aligned Refinement," a method that preserves the SLM's native trajectory while incorporating Oracle's logical repairs, thereby lowering adaptation cost and restoring utility. These findings emphasize the need to optimize both solution quality and learner-data compatibility in mathematical reasoning distillation.

Key takeaway

For Machine Learning Engineers developing Small Language Models for mathematical reasoning, relying solely on high-reward Oracle-refined data for knowledge distillation can be counterproductive. You should prioritize data compatibility with your SLM's native reasoning distribution, even if it means using SLM-generated data selected via rejection sampling. Consider implementing "Style-Aligned Refinement" to preserve your model's intrinsic style while benefiting from Oracle logic, ensuring better downstream utility and avoiding increased adaptation costs.

Key insights

High-quality Oracle data can impair SLM mathematical reasoning due to distributional drift, despite high reward scores.

Principles

Reward model scores alone are insufficient for distillation.
Oracle refinement causes distributional drift.
Learner-data compatibility is crucial for utility.

Method

Style-Aligned Refinement preserves an SLM's native reasoning trajectory while integrating logical repairs from a stronger Oracle, reducing adaptation cost and improving utility.

In practice

Prioritize learner-data compatibility in distillation.
Consider SLM-generated data over Oracle-refined data.
Implement style-aligned refinement techniques.

Topics

Small Language Models
Knowledge Distillation
Mathematical Reasoning
Reward Models
Distributional Drift
Style-Aligned Refinement

Code references

Dracoqhl/Quality-Utility-Paradox

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.