PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

2026-05-28 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision, Natural Language Processing · Depth: Expert, quick

Summary

PARCEL (Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding) is a new visual tokenization architecture designed to address the quadratic computational bottleneck in Large Vision-Language Models (LVLMs) caused by dense token sequences. Existing elastic visual-token compression methods, like spatial-only or query-only compression, struggle under aggressive compression, leading to issues like spectral aliasing or degraded spatial grounding. PARCEL resolves this by dynamically partitioning feature extraction, establishing spatial pool tokens as low-frequency layout anchors, and conditioning elastic query tokens on these anchors through Pool-Conditioned Query Resampling. This approach encourages query tokens to focus on complementary visual features. Evaluated across 27 benchmarks, PARCEL consistently outperforms existing matryoshka baselines, improving the performance-efficiency Pareto frontier while maintaining a "train once, deploy anywhere" paradigm.

Key takeaway

For Machine Learning Engineers optimizing Large Vision-Language Model inference, PARCEL offers a superior approach to visual token compression. You can achieve better performance-efficiency trade-offs across various visual-token budgets compared to existing methods. Consider integrating PARCEL to deploy a single model that adapts efficiently to different computational constraints, preserving fine-grained visual detail without sacrificing spatial grounding. This improves model utility and reduces operational complexity.

Key insights

PARCEL dynamically partitions visual feature extraction using anchored resampling to overcome LVLM compression limitations.

Principles

Spatial-only compression causes spectral aliasing.
Query-only compression degrades spatial grounding.
Anchor tokens can guide elastic queries.

Method

PARCEL establishes spatial pool tokens as low-frequency layout anchors. It then conditions elastic query tokens on these anchors via Pool-Conditioned Query Resampling, directing queries to complementary visual features.

In practice

Improve LVLM efficiency with PARCEL.
Deploy a single model for multiple budgets.
Enhance fine-grained detail in compressed vision.

Topics

Vision-Language Models
Visual Tokenization
Model Compression
Efficient Inference
PARCEL
Performance-Efficiency

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.