(1D) Ordered Tokens Enable Efficient Test-Time Search

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

A study investigates how token structures influence the effectiveness of test-time search in autoregressive (AR) image generation. The research hypothesizes that 1D ordered tokenizers, such as FlexTok (Bachmann et al., 2025), which employ a coarse-to-fine structure, are more amenable to search than traditional 2D grid structures. This is because their intermediate states carry global semantic meaning that verifiers can reliably evaluate, enabling effective steering during generation. Experiments demonstrate that AR models trained on 1D ordered tokens exhibit improved test-time scaling compared to grid-based counterparts. Furthermore, pure test-time search over ordered token sequences can achieve training-free text-to-image generation when guided by an image-text verifier. The study systematically analyzes how classical search algorithms (best-of-N, beam search, lookahead search), different verifiers, and AR priors interact with various token structures, highlighting the impact of token structure on inference-time scalability and providing practical guidance for AR models.

Key takeaway

For Computer Vision Engineers developing autoregressive image generation systems, adopting 1D ordered tokenizers like FlexTok is crucial for maximizing test-time scaling and control. Your choice of tokenizer directly impacts the effectiveness of search algorithms, with beam search yielding superior gains for ordered tokens. This approach enables more efficient inference and even zero-shot control, allowing you to achieve higher quality and alignment by trading inference compute for generation quality without extensive retraining.

Key insights

1D ordered tokens with coarse-to-fine structure significantly enhance test-time search in autoregressive image generation.

Principles

Method

The Search-over-Tokens (SoTo) framework systematically evaluates test-time scaling by combining AR generation with search algorithms (best-of-N, beam, lookahead), diverse verifiers, and varying AR priors (conditional, unconditional, uniform).

In practice

Topics

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.