ConFu: Contemplate the Future for Better Speculative Sampling
Summary
ConFu (Contemplate the Future) is a novel speculative decoding framework designed to accelerate large language model (LLM) inference by improving draft model quality. It addresses the limitation of existing draft models, like the EAGLE series, which suffer from error accumulation by only conditioning on the current prefix. ConFu enables draft models to anticipate the future direction of generation through three innovations: (i) "contemplate tokens" and "soft prompts" that provide future-oriented signals from the target model at negligible cost, (ii) a dynamic contemplate token mechanism using Mixture-of-Experts (MoE) for context-aware future prediction, and (iii) a training framework with "anchor token sampling" and "future prediction replication" for robust learning. Experiments on SpecBench with Llama-3 3B and 8B models demonstrate that ConFu improves token acceptance rates and generation speed over EAGLE-3 by 8-11% across various downstream tasks, sampling temperatures, and draft tree budgets.
Key takeaway
For NLP Engineers optimizing LLM inference, ConFu offers a significant advancement in speculative decoding. By enabling draft models to anticipate future generation, it consistently improves token acceptance rates and decoding speed by 8-11% over EAGLE-3. You should consider integrating ConFu's future-aware signals and dynamic contemplate tokens to mitigate error accumulation and achieve higher throughput in your LLM deployments, especially for Llama-3 3B and 8B models.
Key insights
ConFu enhances speculative decoding by enabling draft models to anticipate future generation direction via target model "thought" signals.
Principles
- Future-aware signals improve draft model accuracy.
- Dynamic tokens adapt to diverse generation contexts.
- Robust training enhances future prediction stability.
Method
ConFu uses contemplate tokens and soft prompts to extract future-oriented signals from the target model. A dynamic MoE mechanism adapts these tokens, and a training framework with anchor token sampling and future prediction replication ensures robust learning.
In practice
- Integrate contemplate tokens for future prediction.
- Employ MoE for dynamic, context-aware signals.
- Use anchor token sampling to reduce training memory.
Topics
- Speculative Decoding
- Large Language Models
- Inference Acceleration
- Draft Models
- Mixture-of-Experts
Code references
Best for: NLP Engineer, AI Scientist, Research Scientist, AI Researcher, AI Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.