Balancing Image Compression and Generation with Bootstrapped Tokenization

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, extended

Summary

SelfBootTok is a novel 1D image tokenizer that enhances image compression and generation by decomposing visual information into distinct global and local token groups. This method, developed by researchers from Peking University and Huawei, uses a self-bootstrapped learning paradigm where local details are predicted exclusively from global tokens, shifting the burden of fine-grained visual information from the generator to the tokenizer. This design results in a more efficient generator, reducing computation by approximately 40% and training time by about 54%. SelfBootTok achieves a new state-of-the-art gFID score of 1.56 using only 64 tokens on ImageNet-256, demonstrating superior reconstruction and generation performance, and scales efficiently by allowing parallel optimization of local aligners and generators.

Key takeaway

For machine learning engineers developing efficient image generation systems, you should investigate SelfBootTok's global-local token decomposition. This approach allows your generator to focus on high-level semantics, significantly reducing computational costs by 40% and training time by 54%. Consider adopting its self-bootstrapped paradigm to achieve state-of-the-art generation quality with fewer tokens, enabling more scalable and resource-efficient model development.

Key insights

SelfBootTok decomposes image tokens into global and local groups, predicting local details from global tokens for efficient generation and scalable training.

Principles

Decompose visual information into global and local token groups to reduce redundancy.
Self-bootstrapped learning shifts detail prediction burden from generator to tokenizer.
Parallel optimization of scaled local aligners and generators enhances training efficiency.

Method

SelfBootTok encodes images into global tokens, then predicts local 1D/2D tokens via MLP/Transformer, aligns 2D to 1D using optimal transport, and finally soft quantizes, fuses, and decodes all tokens.

In practice

Utilize 64 tokens for state-of-the-art image generation performance.
Train the generator once for global tokens, then reuse across different local aligner scales.
Employ a two-stage training strategy for scaling local 2D aligners.

Topics

Image Tokenization
Generative AI
Self-Supervised Learning
Optimal Transport
Diffusion Models
Computational Efficiency

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.