The Death of RLHF: A Practitioner’s Guide to the New Post-Training Stack

· Source: Towards AI - Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, long

Summary

The traditional post-training pipeline for large language models (LLMs), which relied on Supervised Fine-Tuning (SFT) followed by Reinforcement Learning from Human Feedback (RLHF), is now largely obsolete. Modern reasoning models like DeepSeek-R1, Nemotron 3 Super, and Qwen3 utilize a new modular stack that addresses RLHF's limitations: annotation bottlenecks, high computational cost due to a four-model setup, and reward model drift. Key innovations include Group Relative Policy Optimization (GRPO), which eliminates the critic and reward model by using group statistics for advantage calculation, making RL feasible on a single GPU. Reinforcement Learning with Verifiable Rewards (RLVR) replaces human judgment with programmatic verifiers for tasks like math and code, offering scalable and consistent rewards. Decoupled clip and Dynamic sAmpling Policy Optimization (DAPO) further stabilizes GRPO for long chain-of-thought outputs by fixing length bias, entropy collapse, and vanishing gradients. This new stack separates instruction following (SFT), alignment (DPO/SimPO), and reasoning (GRPO+RLVR).

Key takeaway

For AI Engineers fine-tuning LLMs for reasoning tasks, the shift from RLHF to GRPO+RLVR with DAPO fixes means you can now achieve advanced reasoning capabilities on more modest hardware. You should prioritize developing robust programmatic verifiers for your domain-specific tasks, as these are now the primary driver of model improvement. Be mindful of potential issues like reward hacking and training instabilities with long chain-of-thought outputs, and configure your training with token-level loss and appropriate clipping bounds.

Key insights

New post-training methods like GRPO, RLVR, and DAPO have replaced RLHF for reasoning tasks, improving scalability and efficiency.

Principles

Method

The new stack involves SFT for instruction following, DPO/SimPO for alignment, and GRPO+RLVR (with DAPO fixes) for reasoning, using programmatic verifiers for rewards.

In practice

Topics

Best for: AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards AI - Medium.