(1D) Ordered Tokens Enable Efficient Test-Time Search
Summary
A study investigates how token structures influence the effectiveness of test-time search in autoregressive (AR) image generation. The research hypothesizes that 1D ordered tokenizers, such as FlexTok (Bachmann et al., 2025), which employ a coarse-to-fine structure, are more amenable to search than traditional 2D grid structures. This is because their intermediate states carry global semantic meaning that verifiers can reliably evaluate, enabling effective steering during generation. Experiments demonstrate that AR models trained on 1D ordered tokens exhibit improved test-time scaling compared to grid-based counterparts. Furthermore, pure test-time search over ordered token sequences can achieve training-free text-to-image generation when guided by an image-text verifier. The study systematically analyzes how classical search algorithms (best-of-N, beam search, lookahead search), different verifiers, and AR priors interact with various token structures, highlighting the impact of token structure on inference-time scalability and providing practical guidance for AR models.
Key takeaway
For Computer Vision Engineers developing autoregressive image generation systems, adopting 1D ordered tokenizers like FlexTok is crucial for maximizing test-time scaling and control. Your choice of tokenizer directly impacts the effectiveness of search algorithms, with beam search yielding superior gains for ordered tokens. This approach enables more efficient inference and even zero-shot control, allowing you to achieve higher quality and alignment by trading inference compute for generation quality without extensive retraining.
Key insights
1D ordered tokens with coarse-to-fine structure significantly enhance test-time search in autoregressive image generation.
Principles
- Token structure dictates search amenability.
- Coarse-to-fine ordering enables semantic verification.
- Search can compensate for smaller model sizes.
Method
The Search-over-Tokens (SoTo) framework systematically evaluates test-time scaling by combining AR generation with search algorithms (best-of-N, beam, lookahead), diverse verifiers, and varying AR priors (conditional, unconditional, uniform).
In practice
- Prioritize 1D ordered tokenizers for search-guided generation.
- Use beam search for 1D ordered tokens for efficiency.
- Employ ensemble verifiers for robust guidance.
Topics
- 1D Ordered Tokenizers
- Test-Time Search
- Autoregressive Image Generation
- Coarse-to-Fine Token Structure
- Verifier-Guided Generation
Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.