[P] I trained Qwen2.5-1.5b with RLVR (GRPO) vs SFT and compared benchmark performance

2026-03-03 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, quick

Summary

A study compared Reinforcement Learning with Verifiable Rewards (RLVR) and Supervised Fine-tuning (SFT) on the Qwen2.5-1.5B-Instruct model using the GSM8K dataset. RLVR, a method similar to that used in DeepSeek-R1, significantly improved math reasoning scores by +11.9 points. In contrast, SFT, a standard next-token prediction approach, degraded performance by -15.2 points on the same benchmark. Experiments included standard training, a cheating analysis on the GSM8K test set, and one-example RLVR training. The results indicate that RLVR enhances general reasoning, even with minimal data, while SFT appears to override pretrained knowledge, leading to less accurate answers despite reducing the no-answer rate. The project involved benchmarking 388 checkpoints and logging over 2.4 million rows of data.

Key takeaway

For AI Scientists and Research Scientists focused on improving reasoning capabilities in large language models, this research suggests prioritizing RLVR over traditional SFT. Your fine-tuning strategy should critically assess whether SFT is merely teaching format compliance rather than actual reasoning, as it can degrade core abilities. Explore RLVR (GRPO) for tasks requiring robust mathematical or logical inference, even with limited training examples, to achieve substantial performance gains.

Key insights

RLVR significantly boosts math reasoning in LLMs, while SFT can degrade it by overriding pretrained knowledge.

Principles

RLVR improves general reasoning ability.
SFT can degrade pretrained knowledge.
Verifiable rewards enhance model performance.

Method

The study compared RLVR (GRPO) and SFT on Qwen2.5-1.5B-Instruct using the GSM8K dataset, including standard, test-set, and one-example training scenarios.

In practice

Consider RLVR for reasoning-intensive tasks.
Evaluate SFT's impact on pretrained knowledge.
Utilize verifiable signals for reward models.

Topics

RLVR
Supervised Fine-tuning
Qwen2.5-1.5B-Instruct
GSM8K Benchmark
Math Reasoning

Code references

jayminban/RLVR-vs-SFT-Qwen2.5-1.5b

Best for: AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, AI Researcher

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.