[P] I trained Qwen2.5-1.5b with RLVR (GRPO) vs SFT and compared benchmark performance
Summary
A study compared Reinforcement Learning with Verifiable Rewards (RLVR) and Supervised Fine-tuning (SFT) on the Qwen2.5-1.5B-Instruct model using the GSM8K dataset. RLVR, a method similar to that used in DeepSeek-R1, significantly improved math reasoning scores by +11.9 points. In contrast, SFT, a standard next-token prediction approach, degraded performance by -15.2 points on the same benchmark. Experiments included standard training, a cheating analysis on the GSM8K test set, and one-example RLVR training. The results indicate that RLVR enhances general reasoning, even with minimal data, while SFT appears to override pretrained knowledge, leading to less accurate answers despite reducing the no-answer rate. The project involved benchmarking 388 checkpoints and logging over 2.4 million rows of data.
Key takeaway
For AI Scientists and Research Scientists focused on improving reasoning capabilities in large language models, this research suggests prioritizing RLVR over traditional SFT. Your fine-tuning strategy should critically assess whether SFT is merely teaching format compliance rather than actual reasoning, as it can degrade core abilities. Explore RLVR (GRPO) for tasks requiring robust mathematical or logical inference, even with limited training examples, to achieve substantial performance gains.
Key insights
RLVR significantly boosts math reasoning in LLMs, while SFT can degrade it by overriding pretrained knowledge.
Principles
- RLVR improves general reasoning ability.
- SFT can degrade pretrained knowledge.
- Verifiable rewards enhance model performance.
Method
The study compared RLVR (GRPO) and SFT on Qwen2.5-1.5B-Instruct using the GSM8K dataset, including standard, test-set, and one-example training scenarios.
In practice
- Consider RLVR for reasoning-intensive tasks.
- Evaluate SFT's impact on pretrained knowledge.
- Utilize verifiable signals for reward models.
Topics
- RLVR
- Supervised Fine-tuning
- Qwen2.5-1.5B-Instruct
- GSM8K Benchmark
- Math Reasoning
Code references
Best for: AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, AI Researcher
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.