Nemotron-TwoTower: Diffusion Language Modeling with Pretrained Autoregressive Context

2026-06-25 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, medium

Summary

Nemotron-TwoTower introduces a novel block-wise autoregressive diffusion model designed to enhance the efficiency and quality of diffusion language models. Addressing the limitation of single-network approaches that conflate context representation and iterative denoising, Nemotron-TwoTower decouples these functions into two distinct components: a frozen autoregressive context tower that causally processes clean tokens, and a trainable diffusion denoiser tower utilizing bidirectional block attention to refine noisy blocks through cross-attention. This architecture is built upon Nemotron-3-Nano-30B-A3B, an open-weight 30B hybrid Mamba-Transformer Mixture-of-Experts model, and was trained on approximately 2.1 trillion tokens. The model achieves 98.7% of its autoregressive baseline's quality while delivering a 2.42X higher wall-clock generation throughput, with code and weights publicly available.

Key takeaway

For Machine Learning Engineers evaluating generative model architectures, Nemotron-TwoTower offers a compelling solution for high-throughput text generation. If your projects demand both quality and speed, you should consider its decoupled diffusion approach. This model retains 98.7% of autoregressive quality while boosting generation throughput by 2.42X, making it ideal for real-time applications. Explore the released code and weights to integrate this efficient hybrid Mamba-Transformer MoE model into your pipelines.

Key insights

Nemotron-TwoTower decouples context and denoising in diffusion language models for improved efficiency and quality.

Principles

Decoupling model roles enhances capacity.
Frozen AR context tower preserves quality.
Bidirectional attention refines noisy blocks.

Method

Nemotron-TwoTower uses a frozen AR context tower for clean tokens and a trainable diffusion denoiser tower with bidirectional block attention for noisy block refinement via cross-attention.

In practice

Use Nemotron-TwoTower for faster text generation.
Explore hybrid MoE architectures.
Adapt diffusion models from AR baselines.

Topics

Diffusion Language Models
Autoregressive Models
Nemotron-TwoTower
Mamba-Transformer MoE
Text Generation
Model Architecture

Code references

HKUNLP/DiffuLLaMA

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.