Self-Questioning Vision-Language Models: Reinforcement Learning for Compositional Visual Reasoning

2026-06-14 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

Vision-Language Models (VLMs) often struggle with compositional visual reasoning, which requires chaining multiple steps. A new self-questioning framework addresses this by training a VLM to decompose complex visual questions into smaller sub-questions and answer each before providing a final response. This approach utilizes Group Relative Policy Optimization (GRPO), a reinforcement learning algorithm, to guide a 3-billion-parameter model. Crucially, the model autonomously discovers question decomposition strategies, without relying on expensive human-written step-by-step explanations. Its learning is driven by a reward signal that evaluates both the generation of intermediate sub-questions and the accuracy of the final answer. Applied to synthetic scenes (CLEVR) and real-world photographs (A-OKVQA), the framework demonstrated significant improvements. On A-OKVQA, the self-questioning method achieved 52.2% accuracy, and standard reinforcement learning reached 51.6%, both substantially outperforming the untrained model's 46.8%.

Key takeaway

For Machine Learning Engineers developing Vision-Language Models for complex visual reasoning, you should explore self-questioning frameworks. This approach, which rewards your model for generating intermediate sub-questions, significantly improves accuracy on tasks like A-OKVQA (52.2% vs. 46.8% untrained). Implementing this reinforcement learning strategy can reduce reliance on expensive human-annotated step-by-step explanations, streamlining your development process for robust compositional reasoning capabilities.

Key insights

Rewarding VLMs for generating intermediate sub-questions enables them to autonomously discover compositional reasoning strategies for complex visual tasks.

Principles

VLMs struggle with compositional visual reasoning.
Self-questioning improves complex visual reasoning.
Reward signals can guide autonomous decomposition.

Method

A self-questioning framework trains a VLM using Group Relative Policy Optimization (GRPO) to break visual questions into sub-questions. A reward signal scores sub-question generation and final answer correctness.

In practice

Apply GRPO for VLM compositional reasoning.
Reward intermediate steps, not just final answers.
Use synthetic and real-world datasets for training.

Topics

Vision-Language Models
Compositional Reasoning
Reinforcement Learning
Group Relative Policy Optimization
Self-Questioning AI
A-OKVQA Dataset
Computer Vision

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.