Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion [R]

2026-05-15 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Orthrus introduces a novel method for memory-efficient parallel token generation in large language models by integrating a trainable diffusion attention module into each layer of a frozen autoregressive (AR) Transformer. This dual-view diffusion approach allows a diffusion head to project 32 tokens in parallel, which an AR head then verifies in a second pass, accepting the longest matching prefix. Both heads share a single KV cache, ensuring the output distribution remains provably identical to the base model. Orthrus achieves up to 7.8x Tokens Per Forward pass (TPF) and approximately 6x wall-clock speedup on MATH-500, training only 16% of parameters with less than 1 billion tokens in 24 hours on 8x H200 GPUs. It maintains the base model's accuracy, unlike other diffusion LMs, and avoids the external drafter and separate cache overheads of speculative decoding methods, resulting in zero Time-To-First-Token (TTFT) penalty and O(1) KV overhead.

Key takeaway

For AI engineers optimizing LLM inference, Orthrus presents a compelling alternative to speculative decoding. Its ability to achieve significant speedups (up to 7.8x TPF) while preserving the base model's exact accuracy and eliminating TTFT penalties makes it ideal for latency-sensitive applications. Consider integrating Orthrus to enhance throughput without compromising output quality or incurring substantial memory overhead from separate drafter models.

Key insights

Orthrus accelerates LLM inference by integrating a diffusion module into a frozen AR Transformer, enabling parallel token generation.

Principles

Preserve base model accuracy.
Share KV cache for efficiency.
Verify parallel proposals sequentially.

Method

Inject a trainable diffusion attention module into each layer of a frozen AR Transformer. The diffusion head proposes tokens in parallel, and the AR head verifies them, accepting the longest valid prefix.

In practice

Achieve 7.8x TPF on token generation.
Maintain exact base model accuracy.
Reduce KV cache overhead to O(1).

Topics

Orthrus
Parallel Token Generation
Dual-View Diffusion
AR Transformer
Speculative Decoding

Code references

chiennv2000/orthrus

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.