Teaching Multimodal LLMs to Actually See: Perception Programs (P²)
Summary
A new CVPR 2026 paper from Huawei (Canada) and the University of Alberta introduces Perception Programs (P²), a model-agnostic, no-training approach to bridge the representation gap between specialist vision tools and multimodal LLMs. The core issue is that vision tools generate dense, pixel-level outputs (e.g., depth maps), which LLMs struggle to parse effectively for symbolic reasoning. P² addresses this by inserting a translation layer that reorganizes raw tool outputs—from depth, flow, and matching tools—into standardized coordinates and numerical values, presented as YAML-style text. This allows LLMs to "see" and reason over structured visual information. Evaluated on the BLINK benchmark across six perception subtasks, P² demonstrated significant improvements, achieving roughly 19–20 absolute points of average gain with GPT-5 Mini, Gemini-2.5 Pro, InternVL-3.5, and Qwen3-VL, setting a new state of the art. While effective for structured tool outputs, P² has limitations, including unchecked error propagation and applicability only to tools that produce structured output.
Key takeaway
For Machine Learning Engineers developing multimodal LLMs for fine-grained visual perception, you should consider integrating Perception Programs (P²). This approach offers a no-training, model-agnostic method to translate dense vision tool outputs into structured text, significantly improving your LLM's ability to reason over visual data. Implementing P² can yield substantial performance gains, as demonstrated by 19-20 point improvements on the BLINK benchmark, especially for tasks with structured tool outputs.
Key insights
Perception Programs (P²) translate dense visual tool outputs into structured text, enabling multimodal LLMs to reason effectively over fine-grained visual information.
Principles
- Bridge the pixel-to-symbolic representation gap.
- LLMs reason best with structured, textual input.
- Translate dense visual data into symbolic forms.
Method
P² reorganizes dense vision tool outputs (depth, flow) using primitives and relational rules into standardized coordinates and numerical values. This structured data is then formatted as YAML-style text for LLM input, without model training.
In practice
- Enhance LLM performance on fine-grained visual tasks.
- Integrate P² with existing depth or flow tools.
- Improve BLINK benchmark scores by 19-20 points.
Topics
- Multimodal LLMs
- Perception Programs (P²)
- Visual Reasoning
- Representation Gap
- BLINK Benchmark
- Vision Tool Integration
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by LLM on Medium.