Variable-Length Tokenization via Learnable Global Merging for Diffusion Transformers

2026-06-18 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

A novel variable-length tokenizer is proposed for Diffusion Transformers, addressing the limitations of fixed compression ratios in Latent Diffusion Models. Conventional variable-length tokenizers (VLTs) truncate ordered token sequences, leading to position-dependent token semantics and misaligned latent distributions across different lengths. The new approach modulates token length by merging similar tokens, which directly enables cross-length representation alignment when the diffusion transformer operates according to the merging pattern. To ensure compatibility during generation, the method introduces learnable global merging, a data-independent technique. This tokenizer, integrated with a diffusion transformer, achieves a superior gFID-compute trade-off on ImageNet 256x256 generation compared to prior VLT methods. Code for this work is available.

Key takeaway

For Machine Learning Engineers optimizing Latent Diffusion Models for visual synthesis, this new variable-length tokenizer offers a way to achieve better quality-compute trade-offs. You should consider integrating merging-based tokenization, specifically learnable global merging, into your diffusion transformer architectures. This approach resolves cross-length representation misalignment, allowing your models to adaptively balance quality and computational cost more effectively than traditional VLTs.

Key insights

Merging similar tokens enables variable-length tokenization while preserving cross-length representation alignment in Diffusion Transformers.

Principles

Merging similar tokens ensures cross-length alignment.
Data-independent merging is crucial for generation.

Method

Modulate token length by merging similar tokens using learnable global merging. This data-independent approach ensures compatibility with diffusion transformers during generation for cross-length alignment.

Topics

Variable-Length Tokenization
Diffusion Transformers
Latent Diffusion Models
Learnable Global Merging
Visual Synthesis
Image Generation

Code references

movinghoon/lgm

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.