From Prompts to Tokens: Internalizing Causal Supervision in Vision-Language Model for Multi-Image Causal Reasoning

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

BridgeVLM is a novel vision-language model (VLM) designed to internalize visual causal reasoning, addressing the brittleness of existing VLMs in handling interventional and counterfactual queries across multi-image inputs. Unlike methods that rely on external textual prompts, BridgeVLM induces a causal graph from visual inputs, transforming it into structured Causal Tokens. These tokens are then processed by RAMP layers integrated into the LLM decoder, facilitating causal message passing. The model utilizes a unified training interface called M3S for fine-grained causal supervision at both local and global levels. BridgeVLM demonstrates significant performance improvements, achieving 54.4% accuracy on intervention tasks on CausalVLBench, a substantial increase from 33.2% with prompt-level supervision. It also boosts results on Causal3D from 43.6% to 49.0% and dramatically improves causal structure learning on CausalVLBench, with an $F_1$ score rising from 33.4% to 75.1%.

Key takeaway

For Machine Learning Engineers developing vision-language models for complex multi-image causal reasoning, BridgeVLM's method of internalizing causal supervision offers a significant advancement. You should explore converting induced causal graphs into structured Causal Tokens and integrating RAMP layers into your LLM decoders. This approach provides more reliable control during inference and substantially improves accuracy on interventional and counterfactual queries, moving beyond brittle prompt-level supervision.

Key insights

BridgeVLM internalizes visual causal reasoning in VLMs by converting causal graphs into executable tokens for improved control.

Principles

Method

BridgeVLM induces a causal graph from multi-image inputs, converts it to Causal Tokens, and executes these via RAMP layers in the LLM decoder for causal message passing.

In practice

Topics

Best for: Research Scientist, AI Scientist, Computer Vision Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.