Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling

2026-06-10 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Mathematics & Computational Sciences · Depth: Expert, quick

Summary

Bebop presents a systematic study of Multi-Token Prediction (MTP) in Large Language Model (LLM) post-training, offering practical methods to integrate MTP into large-scale Reinforcement Learning (RL) pipelines. The research reveals that MTP acceptance rates are fundamentally bounded by model entropy fluctuation, showing a negative linear relationship with entropy rise during the RL stage. It demonstrates that probabilistic rejection sampling significantly alleviates entropy disturbance compared to greedy draft sampling. Furthermore, Bebop proposes a novel end-to-end TV loss, which directly optimizes multi-step rejection sampling acceptance rates, yielding approximately 10% acceptance rate improvements, achieving up to 95% acceptance rates and up to 25% extra inference throughput gains across mathematical reasoning, code generation, and agentic tasks. This method achieves up to 1.8x end-to-end acceleration in async RL training for Qwen3.5, Qwen3.6, and Qwen3.7 models.

Key takeaway

For AI Engineers and Machine Learning Scientists focused on accelerating LLM Reinforcement Learning training, you should consider implementing Bebop's approach. By adopting probabilistic rejection sampling and the novel end-to-end TV loss for MTP, you can achieve up to 1.8x acceleration and significantly higher acceptance rates, potentially reaching 95%. This strategy also allows for pre-RL MTP training, streamlining your workflow and reducing online update costs.

Key insights

MTP acceptance in RL is limited by model entropy, but rejection sampling and a novel TV loss can significantly improve it.

Principles

MTP acceptance rates correlate negatively with RL entropy.
Probabilistic rejection sampling outperforms greedy draft sampling.
Conventional MTP losses are suboptimal for RL settings.

Method

Bebop proposes a novel end-to-end TV loss to optimize multi-step rejection sampling acceptance rates for MTP in RL training, eliminating the need for costly online MTP updating.

In practice

Integrate MTP with rejection sampling for RL.
Use e2e TV loss for MTP training.
Pre-train MTP before RL to avoid online updates.

Topics

Reinforcement Learning
Multi-Token Prediction
Speculative Decoding
Large Language Models
Rejection Sampling
Model Entropy
Qwen Models

Best for: Research Scientist, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.