Run High-Throughput Reinforcement Learning Training with End-to-End FP8 Precision
Summary
NVIDIA NeMo RL, an open-source library, addresses the challenges of integrating low-precision FP8 datatypes into reinforcement learning (RL) workloads for large language models. RL training involves distinct generation and training phases, each with specific performance requirements. Using FP8 for linear layers in both phases, termed "end-to-end FP8," achieves a >15% throughput improvement for dense models like Llama 3.1 8B Instruct, despite a theoretical 2x speedup. This approach also reduces numerical disagreement between generation and training engines, which is further mitigated by importance sampling, fully closing the accuracy gap to BF16 training. Additionally, extending FP8 to KV cache and attention operations yields an overall ~48% speedup in the rollout stage for models like Qwen3-8B-Base, with a minimal 2-3% calibration overhead.
Key takeaway
For NLP Engineers and Research Scientists optimizing large language model RL training, adopting NVIDIA NeMo RL's end-to-end FP8 recipe is critical. This approach, especially when combined with importance sampling and FP8 for KV cache and attention, can deliver substantial throughput gains (up to 48%) while preserving model accuracy. You should explore the provided NeMo RL GitHub examples to implement these configurations and accelerate your RL workloads.
Key insights
End-to-end FP8 with importance sampling significantly accelerates RL training while maintaining accuracy.
Principles
- RL pipelines benefit from low-precision datatypes like FP8.
- Numerical disagreement in RL pipelines can be mitigated.
- Dynamic recalibration is crucial for FP8 KV cache in RL.
Method
NVIDIA NeMo RL implements end-to-end FP8 for linear layers, KV cache, and attention, using importance sampling and dynamic QKV scale recalibration to optimize RL training performance and accuracy.
In practice
- Configure `precision: fp8` for linear layers in NeMo RL.
- Set `kv_cache_dtype: fp8` for KV cache and attention.
- Experiment with `num_first_layers_in_bf16` for mixed precision.
Topics
- Reinforcement Learning
- FP8 Precision Training
- NVIDIA NeMo RL
- Numerical Disagreement Mitigation
- Importance Sampling
Code references
Best for: NLP Engineer, Research Scientist, Machine Learning Engineer, AI Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.