DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning

2025-01-24 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, extended

Summary

DeReason is a novel difficulty-aware curriculum training strategy designed to improve general scientific (STEM) reasoning capabilities in large language models (LLMs) by optimizing the interplay between supervised fine-tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR). The method addresses the challenge that pure RLVR is sample-inefficient for general STEM domains and is often surpassed by SFT on moderate-quality responses. DeReason partitions training data into reasoning-intensive and non-reasoning-intensive subsets based on LLM-estimated difficulty scores (1-5). It allocates broad-coverage, non-reasoning-intensive problems (difficulty $\leq\tau$) to SFT to build foundational domain knowledge, and reserves a focused subset of difficult problems (difficulty $>\tau$) for RL to cultivate complex reasoning. This principled decoupling strategy, tested on Qwen3-4B-Base, significantly outperforms SFT-only, RL-only, and random-split baselines on general STEM and mathematical benchmarks like MMLU-Pro, GPQA-Diamond, SuperGPQA, and BBEH.

Key takeaway

For research scientists developing general reasoning LLMs, DeReason offers a highly effective post-training recipe. You should consider implementing a difficulty-based data decoupling strategy, using an LLM to score problem complexity and allocating easier, broad-coverage data to SFT for foundational knowledge, followed by RL on a focused set of harder, reasoning-intensive problems. This approach can yield substantial performance gains over traditional SFT-only or RL-only methods, particularly on benchmarks requiring deep reasoning.

Key insights

Difficulty-aware data partitioning for SFT-then-RL training significantly enhances LLM general reasoning.

Principles

SFT excels at efficient knowledge acquisition.
RL pushes performance beyond SFT on challenging problems.
Difficulty correlates with knowledge recall vs. complex reasoning.

Method

DeReason partitions training data by LLM-estimated difficulty, assigning easier problems to SFT for foundational knowledge and harder problems to RL for complex reasoning, then sequentially trains.

In practice

Use LLM-based scoring for problem difficulty.
Allocate easy data to SFT, hard data to RL.
Apply SFT before RL in a sequential pipeline.

Topics

RLVR
Supervised Fine-Tuning
Difficulty-Aware Curriculum
General Reasoning
Data Decoupling

Code references

Jiayi-Pan/TinyZeroAccessed

Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.