Variable-Length Tokenization via Learnable Global Merging for Diffusion Transformers
Summary
A novel variable-length tokenizer is proposed for Diffusion Transformers, addressing the limitations of fixed compression ratios in Latent Diffusion Models. Conventional variable-length tokenizers (VLTs) truncate ordered token sequences, leading to position-dependent token semantics and misaligned latent distributions across different lengths. The new approach modulates token length by merging similar tokens, which directly enables cross-length representation alignment when the diffusion transformer operates according to the merging pattern. To ensure compatibility during generation, the method introduces learnable global merging, a data-independent technique. This tokenizer, integrated with a diffusion transformer, achieves a superior gFID-compute trade-off on ImageNet 256x256 generation compared to prior VLT methods. Code for this work is available.
Key takeaway
For Machine Learning Engineers optimizing Latent Diffusion Models for visual synthesis, this new variable-length tokenizer offers a way to achieve better quality-compute trade-offs. You should consider integrating merging-based tokenization, specifically learnable global merging, into your diffusion transformer architectures. This approach resolves cross-length representation misalignment, allowing your models to adaptively balance quality and computational cost more effectively than traditional VLTs.
Key insights
Merging similar tokens enables variable-length tokenization while preserving cross-length representation alignment in Diffusion Transformers.
Principles
- Merging similar tokens ensures cross-length alignment.
- Data-independent merging is crucial for generation.
Method
Modulate token length by merging similar tokens using learnable global merging. This data-independent approach ensures compatibility with diffusion transformers during generation for cross-length alignment.
Topics
- Variable-Length Tokenization
- Diffusion Transformers
- Latent Diffusion Models
- Learnable Global Merging
- Visual Synthesis
- Image Generation
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.