Nemotron-TwoTower: Diffusion Language Modeling with Pretrained Autoregressive Context

2026-06-25 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

Nemotron-TwoTower is a novel block-wise autoregressive diffusion language model designed to overcome limitations in existing diffusion models that use a single network for both context representation and iterative denoising. This new architecture decouples these functions into two distinct components: a frozen autoregressive context tower that causally processes clean tokens, and a trainable diffusion denoiser tower employing bidirectional block attention to refine noisy blocks through cross-attention to the context. Built upon Nemotron-3-Nano-30B-A3B, an open-weight 30B hybrid Mamba-Transformer MoE model, and trained on approximately 2.1 trillion tokens, Nemotron-TwoTower achieves 98.7% of the autoregressive baseline's quality. Crucially, it delivers 2.42X higher wall-clock generation throughput, with its code and model weights publicly available.

Key takeaway

For Machine Learning Engineers developing or deploying large language models, Nemotron-TwoTower offers a compelling solution to improve generation speed without significant quality loss. If your current autoregressive models are bottlenecked by inference throughput, you should consider integrating this two-tower diffusion architecture. Its 2.42X higher wall-clock generation throughput, while retaining 98.7% quality, means you can achieve faster responses and scale your applications more efficiently. Explore the released code and model weights to evaluate its fit for your specific use cases.

Key insights

Decoupling context and denoising in diffusion models significantly boosts generation throughput while maintaining quality.

Principles

Separate context processing from denoising.
Utilize a frozen AR context tower.
Employ a trainable diffusion denoiser.

Method

Nemotron-TwoTower uses a frozen autoregressive context tower for clean token processing and a trainable diffusion denoiser tower with bidirectional block attention for refining noisy blocks via cross-attention.

In practice

Explore two-tower architectures for diffusion LMs.
Integrate Nemotron-TwoTower for faster generation.
Analyze performance of hybrid Mamba-Transformer models.

Topics

Diffusion Models
Language Models
Autoregressive Models
Nemotron-TwoTower
Model Architecture
Inference Optimization

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.