Nemotron-TwoTower: Diffusion Language Modeling with Pretrained Autoregressive Context
Summary
Nemotron-TwoTower is a novel block-wise autoregressive diffusion language model designed to overcome limitations in existing diffusion models that use a single network for both context representation and iterative denoising. This new architecture decouples these functions into two distinct components: a frozen autoregressive context tower that causally processes clean tokens, and a trainable diffusion denoiser tower employing bidirectional block attention to refine noisy blocks through cross-attention to the context. Built upon Nemotron-3-Nano-30B-A3B, an open-weight 30B hybrid Mamba-Transformer MoE model, and trained on approximately 2.1 trillion tokens, Nemotron-TwoTower achieves 98.7% of the autoregressive baseline's quality. Crucially, it delivers 2.42X higher wall-clock generation throughput, with its code and model weights publicly available.
Key takeaway
For Machine Learning Engineers developing or deploying large language models, Nemotron-TwoTower offers a compelling solution to improve generation speed without significant quality loss. If your current autoregressive models are bottlenecked by inference throughput, you should consider integrating this two-tower diffusion architecture. Its 2.42X higher wall-clock generation throughput, while retaining 98.7% quality, means you can achieve faster responses and scale your applications more efficiently. Explore the released code and model weights to evaluate its fit for your specific use cases.
Key insights
Decoupling context and denoising in diffusion models significantly boosts generation throughput while maintaining quality.
Principles
- Separate context processing from denoising.
- Utilize a frozen AR context tower.
- Employ a trainable diffusion denoiser.
Method
Nemotron-TwoTower uses a frozen autoregressive context tower for clean token processing and a trainable diffusion denoiser tower with bidirectional block attention for refining noisy blocks via cross-attention.
In practice
- Explore two-tower architectures for diffusion LMs.
- Integrate Nemotron-TwoTower for faster generation.
- Analyze performance of hybrid Mamba-Transformer models.
Topics
- Diffusion Models
- Language Models
- Autoregressive Models
- Nemotron-TwoTower
- Model Architecture
- Inference Optimization
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.