The Signal is in the Steps: Local Scoring for Reasoning Data Selection
Summary
A new study introduces "Local Naturalness," a novel method for selecting high-quality reasoning traces from stronger teacher models to fine-tune smaller student Large Language Models (LLMs). The current standard, which relies on global log-probability to assess a response's "naturalness," fails in multi-teacher settings and with long reasoning traces (10K+ tokens), as it does not correlate with downstream performance. Local Naturalness, however, scores responses by measuring a student model's log-probabilities over short, sequential reasoning steps (e.g., sentences) conditioned on a small local window. This approach enables reliable teacher selection and significantly boosts a 32-billion-parameter student's accuracy on math benchmarks by 9.4% over global-naturalness-based selection, even surpassing training on data from the single best teacher. The method's effectiveness extends to scientific and coding domains, demonstrating its generalizability.
Key takeaway
For NLP engineers and research scientists involved in distilling reasoning capabilities into smaller LLMs, adopting Local Naturalness for data selection is crucial. Your current reliance on global log-probability for selecting teacher responses, especially in multi-teacher or long-context scenarios, likely leads to suboptimal student model performance. Implement local log-probability scoring to accurately identify the most beneficial teacher models and curate high-quality, mixed-teacher datasets, thereby significantly improving downstream reasoning accuracy and generalization across domains like math, science, and code.
Key insights
Local Naturalness improves LLM reasoning distillation by evaluating short, sequential steps, outperforming global log-probability in multi-teacher settings.
Principles
- Global log-probability is unreliable for long reasoning traces.
- Reasoning quality is better assessed at local, step-by-step levels.
- Smaller context windows yield more reliable "naturalness" assessments.
Method
Local Naturalness calculates a response's log-probability by averaging log-probabilities of its constituent logical steps, each conditioned on a limited context of at most four preceding sentences, rather than the entire sequence.
In practice
- Use Local Naturalness for optimal teacher model selection.
- Apply local scoring to curate mixed-teacher datasets.
- Prioritize local over global metrics for long-context reasoning tasks.
Topics
- Reasoning Data Selection
- Local Naturalness
- Global Log Probability
- Supervised Fine-Tuning
- Multi-Teacher Distillation
Code references
Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.