Teaching Multimodal LLMs to Actually See: Perception Programs (P²)

2026-06-22 · Source: LLM on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, quick

Summary

A new CVPR 2026 paper from Huawei (Canada) and the University of Alberta introduces Perception Programs (P²), a model-agnostic, no-training approach to bridge the representation gap between specialist vision tools and multimodal LLMs. The core issue is that vision tools generate dense, pixel-level outputs (e.g., depth maps), which LLMs struggle to parse effectively for symbolic reasoning. P² addresses this by inserting a translation layer that reorganizes raw tool outputs—from depth, flow, and matching tools—into standardized coordinates and numerical values, presented as YAML-style text. This allows LLMs to "see" and reason over structured visual information. Evaluated on the BLINK benchmark across six perception subtasks, P² demonstrated significant improvements, achieving roughly 19–20 absolute points of average gain with GPT-5 Mini, Gemini-2.5 Pro, InternVL-3.5, and Qwen3-VL, setting a new state of the art. While effective for structured tool outputs, P² has limitations, including unchecked error propagation and applicability only to tools that produce structured output.

Key takeaway

For Machine Learning Engineers developing multimodal LLMs for fine-grained visual perception, you should consider integrating Perception Programs (P²). This approach offers a no-training, model-agnostic method to translate dense vision tool outputs into structured text, significantly improving your LLM's ability to reason over visual data. Implementing P² can yield substantial performance gains, as demonstrated by 19-20 point improvements on the BLINK benchmark, especially for tasks with structured tool outputs.

Key insights

Perception Programs (P²) translate dense visual tool outputs into structured text, enabling multimodal LLMs to reason effectively over fine-grained visual information.

Principles

Bridge the pixel-to-symbolic representation gap.
LLMs reason best with structured, textual input.
Translate dense visual data into symbolic forms.

Method

P² reorganizes dense vision tool outputs (depth, flow) using primitives and relational rules into standardized coordinates and numerical values. This structured data is then formatted as YAML-style text for LLM input, without model training.

In practice

Enhance LLM performance on fine-grained visual tasks.
Integrate P² with existing depth or flow tools.
Improve BLINK benchmark scores by 19-20 points.

Topics

Multimodal LLMs
Perception Programs (P²)
Visual Reasoning
Representation Gap
BLINK Benchmark
Vision Tool Integration

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by LLM on Medium.