Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion [R]
Summary
Orthrus introduces a novel method for memory-efficient parallel token generation in large language models by integrating a trainable diffusion attention module into each layer of a frozen autoregressive (AR) Transformer. This dual-view diffusion approach allows a diffusion head to project 32 tokens in parallel, which an AR head then verifies in a second pass, accepting the longest matching prefix. Both heads share a single KV cache, ensuring the output distribution remains provably identical to the base model. Orthrus achieves up to 7.8x Tokens Per Forward pass (TPF) and approximately 6x wall-clock speedup on MATH-500, training only 16% of parameters with less than 1 billion tokens in 24 hours on 8x H200 GPUs. It maintains the base model's accuracy, unlike other diffusion LMs, and avoids the external drafter and separate cache overheads of speculative decoding methods, resulting in zero Time-To-First-Token (TTFT) penalty and O(1) KV overhead.
Key takeaway
For AI engineers optimizing LLM inference, Orthrus presents a compelling alternative to speculative decoding. Its ability to achieve significant speedups (up to 7.8x TPF) while preserving the base model's exact accuracy and eliminating TTFT penalties makes it ideal for latency-sensitive applications. Consider integrating Orthrus to enhance throughput without compromising output quality or incurring substantial memory overhead from separate drafter models.
Key insights
Orthrus accelerates LLM inference by integrating a diffusion module into a frozen AR Transformer, enabling parallel token generation.
Principles
- Preserve base model accuracy.
- Share KV cache for efficiency.
- Verify parallel proposals sequentially.
Method
Inject a trainable diffusion attention module into each layer of a frozen AR Transformer. The diffusion head proposes tokens in parallel, and the AR head verifies them, accepting the longest valid prefix.
In practice
- Achieve 7.8x TPF on token generation.
- Maintain exact base model accuracy.
- Reduce KV cache overhead to O(1).
Topics
- Orthrus
- Parallel Token Generation
- Dual-View Diffusion
- AR Transformer
- Speculative Decoding
Code references
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.