AdaTok: Self-Budgeting Image Tokenization with Quality-Preserving Dynamic Tokens

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

AdaTok is a novel self-budgeting discrete 1D image tokenizer designed to address the inefficiency of fixed token counts in traditional image encoding. Unlike conventional methods that use a uniform token budget regardless of visual complexity, AdaTok dynamically adjusts its token allocation in a single pass. The system integrates Prioritized Representation Learning, which orders tokens using nested tail masking and employs Multi-Head LoRA decoder heads to manage budget-dependent semantic shifts. It also features Adaptive Token Allocation, training a lightweight deterministic-group GRPO policy to select optimal budgets, with Dynamic Pareto Weighting balancing fidelity and efficiency. On ImageNet-1K, AdaTok-Full achieves an rFID of 1.31 at 256 tokens, while AdaTok-Adaptive reaches rFID 1.50 using approximately 118 tokens on average, surpassing discrete 1D baselines. This adaptive approach yields about 2.1x throughput in autoregressive image generation compared to a fixed 256-token decode, demonstrating that token count can be a learned, content-conditioned output.

Key takeaway

For Machine Learning Engineers optimizing image generation or compression pipelines, AdaTok presents a compelling solution to dynamically manage token budgets. You should consider integrating this self-budgeting approach to learn content-conditioned token counts, which can significantly reduce computational overhead. This method allows you to achieve approximately 2.1x throughput in autoregressive image generation compared to fixed token systems, ensuring efficient resource utilization while maintaining image quality.

Key insights

Self-budgeting image tokenization dynamically adjusts token counts based on visual complexity, improving efficiency and fidelity.

Principles

Method

AdaTok combines Prioritized Representation Learning for token ordering and Multi-Head LoRA for semantic shift, with Adaptive Token Allocation using a GRPO policy and Dynamic Pareto Weighting.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.