Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling
Summary
Bebop presents a systematic study of Multi-Token Prediction (MTP) in Large Language Model (LLM) post-training, offering practical methods to integrate MTP into large-scale Reinforcement Learning (RL) pipelines. The research reveals that MTP acceptance rates are fundamentally bounded by model entropy fluctuation, showing a negative linear relationship with entropy rise during the RL stage. It demonstrates that probabilistic rejection sampling significantly alleviates entropy disturbance compared to greedy draft sampling. Furthermore, Bebop proposes a novel end-to-end TV loss, which directly optimizes multi-step rejection sampling acceptance rates, yielding approximately 10% acceptance rate improvements, achieving up to 95% acceptance rates and up to 25% extra inference throughput gains across mathematical reasoning, code generation, and agentic tasks. This method achieves up to 1.8x end-to-end acceleration in async RL training for Qwen3.5, Qwen3.6, and Qwen3.7 models.
Key takeaway
For AI Engineers and Machine Learning Scientists focused on accelerating LLM Reinforcement Learning training, you should consider implementing Bebop's approach. By adopting probabilistic rejection sampling and the novel end-to-end TV loss for MTP, you can achieve up to 1.8x acceleration and significantly higher acceptance rates, potentially reaching 95%. This strategy also allows for pre-RL MTP training, streamlining your workflow and reducing online update costs.
Key insights
MTP acceptance in RL is limited by model entropy, but rejection sampling and a novel TV loss can significantly improve it.
Principles
- MTP acceptance rates correlate negatively with RL entropy.
- Probabilistic rejection sampling outperforms greedy draft sampling.
- Conventional MTP losses are suboptimal for RL settings.
Method
Bebop proposes a novel end-to-end TV loss to optimize multi-step rejection sampling acceptance rates for MTP in RL training, eliminating the need for costly online MTP updating.
In practice
- Integrate MTP with rejection sampling for RL.
- Use e2e TV loss for MTP training.
- Pre-train MTP before RL to avoid online updates.
Topics
- Reinforcement Learning
- Multi-Token Prediction
- Speculative Decoding
- Large Language Models
- Rejection Sampling
- Model Entropy
- Qwen Models
Best for: Research Scientist, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.