Reasoning Quality Emerges Early: Data Curation for Reasoning Models
Summary
A novel data curation method for supervised fine-tuning (SFT) of Large Language Models (LLMs) significantly improves reasoning capabilities by identifying high-quality, challenging examples more efficiently. Unlike existing approaches that depend on strong reasoning models for filtering, this new technique detects difficult problems by analyzing the loss of the initial 100 reasoning tokens at a randomly perturbed checkpoint of the pretrained model. Furthermore, it identifies examples with similar loss patterns over their first 1,000 reasoning tokens across multiple perturbed checkpoints, which are shown to induce similar gradients. Experimental validation on Qwen2.5-7B and Llama3.1-8B models, using the M23K medical reasoning and OpenThoughts-Math datasets, demonstrates that this method outperforms existing baselines by up to 1.7% while achieving 91% greater token efficiency.
Key takeaway
For Machine Learning Engineers fine-tuning LLMs for complex reasoning tasks, you should integrate early token loss analysis into your data curation pipeline. This method allows you to identify high-quality, challenging examples using only the first 100-1000 reasoning tokens, significantly reducing computational costs. By adopting this approach, you can improve model performance by up to 1.7% and achieve 91% greater token efficiency compared to traditional filtering methods.
Key insights
Reasoning quality in LLMs can be effectively improved by curating SFT data based on early token loss patterns, significantly boosting efficiency.
Principles
- Early reasoning token loss indicates problem difficulty.
- Similar loss patterns imply similar gradient induction.
- Perturbed checkpoints reveal data quality.
Method
Difficult reasoning problems are detected by evaluating the loss of the first 100 reasoning tokens at a randomly perturbed pretrained model checkpoint. Similar loss patterns over 1k tokens across checkpoints identify examples inducing similar gradients.
In practice
- Use initial 100 tokens for difficulty detection.
- Analyze 1k token loss patterns for gradient similarity.
- Apply to Qwen2.5-7B, Llama3.1-8B fine-tuning.
Topics
- Supervised Fine-tuning
- Large Language Models
- Data Curation
- Reasoning Tasks
- Token Efficiency
- Qwen2.5-7B
- Llama3.1-8B
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.