Nemotron-TwoTower: Diffusion Language Modeling with Pretrained Autoregressive Context
Summary
Nemotron-TwoTower introduces a novel block-wise autoregressive diffusion model designed to enhance the efficiency and quality of diffusion language models. Addressing the limitation of single-network approaches that conflate context representation and iterative denoising, Nemotron-TwoTower decouples these functions into two distinct components: a frozen autoregressive context tower that causally processes clean tokens, and a trainable diffusion denoiser tower utilizing bidirectional block attention to refine noisy blocks through cross-attention. This architecture is built upon Nemotron-3-Nano-30B-A3B, an open-weight 30B hybrid Mamba-Transformer Mixture-of-Experts model, and was trained on approximately 2.1 trillion tokens. The model achieves 98.7% of its autoregressive baseline's quality while delivering a 2.42X higher wall-clock generation throughput, with code and weights publicly available.
Key takeaway
For Machine Learning Engineers evaluating generative model architectures, Nemotron-TwoTower offers a compelling solution for high-throughput text generation. If your projects demand both quality and speed, you should consider its decoupled diffusion approach. This model retains 98.7% of autoregressive quality while boosting generation throughput by 2.42X, making it ideal for real-time applications. Explore the released code and weights to integrate this efficient hybrid Mamba-Transformer MoE model into your pipelines.
Key insights
Nemotron-TwoTower decouples context and denoising in diffusion language models for improved efficiency and quality.
Principles
- Decoupling model roles enhances capacity.
- Frozen AR context tower preserves quality.
- Bidirectional attention refines noisy blocks.
Method
Nemotron-TwoTower uses a frozen AR context tower for clean tokens and a trainable diffusion denoiser tower with bidirectional block attention for noisy block refinement via cross-attention.
In practice
- Use Nemotron-TwoTower for faster text generation.
- Explore hybrid MoE architectures.
- Adapt diffusion models from AR baselines.
Topics
- Diffusion Language Models
- Autoregressive Models
- Nemotron-TwoTower
- Mamba-Transformer MoE
- Text Generation
- Model Architecture
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.