Zyphra Releases ZAYA1-8B-Diffusion-Preview: The First MoE Diffusion Model Converted From an Autoregressive LLM With Up to 7.7x Speedup

2026-05-15 · Source: Machine Learning ML & Generative AI News · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, quick

Summary

Zyphra has released ZAYA1-8B-Diffusion-Preview, which is presented as the first Mixture-of-Experts (MoE) diffusion model converted from an autoregressive Large Language Model (LLM). This new model addresses the memory-bandwidth bottleneck common in LLM inference by generating 16 tokens simultaneously while sharing a single KV-cache, shifting decoding from memory-bound to compute-bound. It achieves a 4.6x speedup with a lossless sampler and up to a 7.7x speedup using a logit-mixing sampler, which involves a minor quality trade-off. ZAYA1-8B-Diffusion-Preview also outperforms MTP and EAGLE3 in inference speed and is notable for being the first diffusion-LM trained on AMD hardware. The model was developed using the TiDAR recipe on an existing ZAYA1-8B checkpoint, requiring 1.1T tokens of additional mid-training.

Key takeaway

For NLP engineers optimizing LLM inference, ZAYA1-8B-Diffusion-Preview demonstrates a viable path to overcome memory-bandwidth limitations. You should investigate diffusion model conversion for your existing autoregressive LLMs, especially if you are seeking substantial speedups like the reported 4.6x to 7.7x, and consider the trade-offs of different sampling methods for your specific application needs.

Key insights

Converting autoregressive LLMs to diffusion models can significantly boost inference speed by optimizing KV-cache usage.

Principles

Shared KV-cache improves GPU utilization
Diffusion models can accelerate token generation

Method

The TiDAR recipe enables conversion of existing autoregressive LLM checkpoints to diffusion models without training from scratch, requiring additional mid-training tokens.

In practice

Consider diffusion for LLM inference speedup
Evaluate logit-mixing for higher throughput
Explore AMD hardware for diffusion-LM training

Topics

ZAYA1-8B-Diffusion-Preview
MoE Diffusion Model
Autoregressive LLM
Inference Speedup
KV-Cache Optimization

Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.