DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, extended

Summary

DeReason is a novel difficulty-aware curriculum training strategy designed to improve general scientific (STEM) reasoning capabilities in large language models (LLMs) by optimizing the interplay between supervised fine-tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR). The method addresses the challenge that pure RLVR is sample-inefficient for general STEM domains and is often surpassed by SFT on moderate-quality responses. DeReason partitions training data into reasoning-intensive and non-reasoning-intensive subsets based on LLM-estimated difficulty scores (1-5). It allocates broad-coverage, non-reasoning-intensive problems (difficulty $\leq\tau$) to SFT to build foundational domain knowledge, and reserves a focused subset of difficult problems (difficulty $>\tau$) for RL to cultivate complex reasoning. This principled decoupling strategy, tested on Qwen3-4B-Base, significantly outperforms SFT-only, RL-only, and random-split baselines on general STEM and mathematical benchmarks like MMLU-Pro, GPQA-Diamond, SuperGPQA, and BBEH.

Key takeaway

For research scientists developing general reasoning LLMs, DeReason offers a highly effective post-training recipe. You should consider implementing a difficulty-based data decoupling strategy, using an LLM to score problem complexity and allocating easier, broad-coverage data to SFT for foundational knowledge, followed by RL on a focused set of harder, reasoning-intensive problems. This approach can yield substantial performance gains over traditional SFT-only or RL-only methods, particularly on benchmarks requiring deep reasoning.

Key insights

Difficulty-aware data partitioning for SFT-then-RL training significantly enhances LLM general reasoning.

Principles

Method

DeReason partitions training data by LLM-estimated difficulty, assigning easier problems to SFT for foundational knowledge and harder problems to RL for complex reasoning, then sequentially trains.

In practice

Topics

Code references

Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.