End-to-End Training for Unified Tokenization and Latent Denoising

2026-03-23 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Pattern Recognition · Depth: Advanced, quick

Summary

UNITE is a novel autoencoder architecture designed for unified tokenization and latent diffusion, addressing the complex, multi-stage training process typically required for latent diffusion models (LDMs). This architecture features a Generative Encoder that functions as both an image tokenizer and a latent generator through weight sharing. The core idea behind UNITE is that tokenization and generation are fundamentally the same latent inference problem, differing only in their conditioning. UNITE employs a single-stage training procedure that jointly optimizes both tasks via two forward passes through the shared Generative Encoder, allowing gradients to collaboratively shape a "common latent language." This approach achieves near state-of-the-art performance across image and molecule modalities, reaching FID scores of 2.12 and 1.73 for Base and Large models, respectively, on ImageNet 256x256, without relying on adversarial losses or pretrained encoders like DINO.

Key takeaway

For research scientists developing latent diffusion models, UNITE demonstrates that single-stage joint training of tokenization and generation is feasible and effective. This approach simplifies the LDM training pipeline, potentially reducing development time and computational resources by eliminating the need for separate tokenizer pre-training and adversarial losses. Consider adopting this unified architecture to streamline your LDM development and achieve competitive performance.

Key insights

Unified tokenization and latent diffusion can be jointly optimized in a single-stage training process.

Principles

Tokenization and generation are latent inference problems.
Weight sharing enables joint optimization of related tasks.

Method

UNITE uses a Generative Encoder for both tokenization and latent generation, optimizing both tasks in a single stage via two forward passes to create a shared latent space.

In practice

Train LDMs end-to-end without separate tokenizer pre-training.
Apply UNITE to image and molecule synthesis tasks.

Topics

Latent Diffusion Models
Image Tokenization
Autoencoder Architectures
Generative Models
Single-Stage Training

Best for: Computer Vision Engineer, Research Scientist, AI Researcher, AI Scientist, Deep Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.