How Many Visual Tokens Do Multimodal Language Models Need? Scaling Visual Token Pruning with F^3A
Summary
The Fruit-Fly-Foraging Algorithm (F3A) is a novel, training-free router designed to prune visual tokens in multimodal language models (MLLMs) before they are consumed by the language backbone. This method addresses the high inference cost associated with increasingly long visual token sequences by viewing pruning as a task-conditioned evidence search. F3A operates in three stages: coarse evidence localization, local refinement, and recovery of under-covered regions, guided by lightweight question-conditioned cues and frozen sparse sensing heads. Evaluated on Qwen3-VL models (2B to 235B parameters), F3A consistently outperforms existing training-free pruning methods like FastV, VisionZip, DivPrune, and CDPruner across 11 multimodal benchmarks and three retention ratios (60%, 40%, 20%). At 20% visual token retention, F3A retains 93.86% of full-token performance on average, and to achieve 97% of full-token performance, it requires only 39.9% visual tokens, compared to 50.1% for the strongest baseline.
Key takeaway
For AI Engineers optimizing multimodal LLM inference, F3A offers a significant advancement in visual token pruning. By reframing pruning as a task-conditioned evidence search, F3A consistently achieves higher accuracy at aggressive compression ratios and requires fewer tokens to maintain near-full model performance. You should consider integrating F3A into your MLLM deployment pipeline to reduce computational costs and latency without sacrificing model accuracy, especially for large-scale models like Qwen3-VL-235B-A22B.
Key insights
Visual token pruning in MLLMs is best approached as task-conditioned evidence search, not static ranking.
Principles
- Performance scales with resource allocation, not just raw resource increase.
- Token value depends on the query and other selected tokens, not just individual salience.
- Combining coarse exploration with local refinement and recovery improves pruning.
Method
F3A uses prompt-conditioned cues and sparse sensing heads to guide a three-stage foraging process: coarse search, visual lock-on, and rescue jump, allocating a fixed visual token budget before LLM prefill.
In practice
- F3A reduces end-to-end latency by up to 1.29x on Qwen3-VL-8B.
- It lowers KV-cache footprint from 117.7 MB to 29.0 MB at 20% retention.
- The method is training-free and preserves the original MLLM pipeline.
Topics
- Fruit-Fly-Foraging Algorithm (F3A)
- Visual Token Pruning
- Multimodal Language Models
- Qwen3-VL
- Training-Free Pruning
Code references
Best for: Research Scientist, AI Engineer, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.