The Signal is in the Steps: Local Scoring for Reasoning Data Selection

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

A new study introduces "Local Naturalness," a novel method for selecting high-quality reasoning traces from stronger teacher models to fine-tune smaller student Large Language Models (LLMs). The current standard, which relies on global log-probability to assess a response's "naturalness," fails in multi-teacher settings and with long reasoning traces (10K+ tokens), as it does not correlate with downstream performance. Local Naturalness, however, scores responses by measuring a student model's log-probabilities over short, sequential reasoning steps (e.g., sentences) conditioned on a small local window. This approach enables reliable teacher selection and significantly boosts a 32-billion-parameter student's accuracy on math benchmarks by 9.4% over global-naturalness-based selection, even surpassing training on data from the single best teacher. The method's effectiveness extends to scientific and coding domains, demonstrating its generalizability.

Key takeaway

For NLP engineers and research scientists involved in distilling reasoning capabilities into smaller LLMs, adopting Local Naturalness for data selection is crucial. Your current reliance on global log-probability for selecting teacher responses, especially in multi-teacher or long-context scenarios, likely leads to suboptimal student model performance. Implement local log-probability scoring to accurately identify the most beneficial teacher models and curate high-quality, mixed-teacher datasets, thereby significantly improving downstream reasoning accuracy and generalization across domains like math, science, and code.

Key insights

Local Naturalness improves LLM reasoning distillation by evaluating short, sequential steps, outperforming global log-probability in multi-teacher settings.

Principles

Method

Local Naturalness calculates a response's log-probability by averaging log-probabilities of its constituent logical steps, each conditioned on a limited context of at most four preceding sentences, rather than the entire sequence.

In practice

Topics

Code references

Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.