The Death of RLHF: A Practitioner’s Guide to the New Post-Training Stack
Summary
The traditional post-training pipeline for large language models (LLMs), which relied on Supervised Fine-Tuning (SFT) followed by Reinforcement Learning from Human Feedback (RLHF), is now largely obsolete. Modern reasoning models like DeepSeek-R1, Nemotron 3 Super, and Qwen3 utilize a new modular stack that addresses RLHF's limitations: annotation bottlenecks, high computational cost due to a four-model setup, and reward model drift. Key innovations include Group Relative Policy Optimization (GRPO), which eliminates the critic and reward model by using group statistics for advantage calculation, making RL feasible on a single GPU. Reinforcement Learning with Verifiable Rewards (RLVR) replaces human judgment with programmatic verifiers for tasks like math and code, offering scalable and consistent rewards. Decoupled clip and Dynamic sAmpling Policy Optimization (DAPO) further stabilizes GRPO for long chain-of-thought outputs by fixing length bias, entropy collapse, and vanishing gradients. This new stack separates instruction following (SFT), alignment (DPO/SimPO), and reasoning (GRPO+RLVR).
Key takeaway
For AI Engineers fine-tuning LLMs for reasoning tasks, the shift from RLHF to GRPO+RLVR with DAPO fixes means you can now achieve advanced reasoning capabilities on more modest hardware. You should prioritize developing robust programmatic verifiers for your domain-specific tasks, as these are now the primary driver of model improvement. Be mindful of potential issues like reward hacking and training instabilities with long chain-of-thought outputs, and configure your training with token-level loss and appropriate clipping bounds.
Key insights
New post-training methods like GRPO, RLVR, and DAPO have replaced RLHF for reasoning tasks, improving scalability and efficiency.
Principles
- Eliminate human annotation bottlenecks.
- Reduce computational overhead in RL.
- Modularize post-training objectives.
Method
The new stack involves SFT for instruction following, DPO/SimPO for alignment, and GRPO+RLVR (with DAPO fixes) for reasoning, using programmatic verifiers for rewards.
In practice
- GRPO enables RL fine-tuning on consumer GPUs.
- Programmatic verifiers are the new reward function lever.
- DAPO fixes GRPO instabilities for long CoT outputs.
Topics
- LLM Post-Training Stack
- Group Relative Policy Optimization
- Reinforcement Learning with Verifiable Rewards
- Decoupled clip and Dynamic sAmpling Policy Optimization
- Programmatic Verifiers
Best for: AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards AI - Medium.