PaLMR: Towards Faithful Visual Reasoning via Multimodal Process Alignment

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

PaLMR (Process Alignment for Multimodal Reasoning) is a framework designed to mitigate process hallucinations in Multimodal Large Language Models (MLLMs), where models achieve correct answers despite flawed visual reasoning. It introduces a perception-aligned data layer (PaDLayer) that generates process-aware reasoning data with structured pseudo-ground-truths and verifiable visual facts. Complementing this, a process-aligned optimization layer (PaOLayer) employs a hierarchical reward fusion scheme, including a process-aware scoring function, to foster visually faithful chains-of-thought and enhance training stability. Experiments on Qwen2.5-VL-7B demonstrate that PaLMR significantly reduces reasoning hallucinations and improves visual reasoning fidelity, achieving state-of-the-art results on HallusionBench while maintaining strong performance across MMMU, MathVista, and MathVerse using approximately 4.7K high-quality samples.

Key takeaway

For AI Scientists and ML Engineers developing multimodal LLMs, focusing solely on final answer correctness in reinforcement learning risks propagating visual hallucinations. You should integrate process-level alignment techniques like PaLMR's hierarchical reward fusion, which prioritizes visual fidelity at each reasoning step. This approach, demonstrated to reduce hallucinations and improve reasoning stability, offers a path to more reliable and interpretable MLLMs, even with smaller, high-quality datasets.

Key insights

Aligning multimodal reasoning processes with visual evidence, not just final answers, is crucial to prevent hallucinations.

Principles

Reward mechanisms must supervise reasoning process faithfulness, not just outcome correctness.
Hierarchical reward fusion, prioritizing visual fidelity, enhances training stability.
Pairwise comparison for visual consistency scoring is more robust than point-wise methods.

Method

PaLMR constructs perception-aligned data via learnability-based filtering and Gemini-generated pseudo-ground-truths, then optimizes with Vision-Guided GRPO (V-GRPO) using a hierarchical reward function that gates final answer rewards on visual fidelity.

In practice

Generate structured, question-agnostic visual ground truths using powerful LLMs like Gemini.
Implement pairwise visual fidelity scoring with an LLM-as-judge (e.g., Qwen3-30B-A3B).
Design reward functions to strictly penalize perceptual errors, setting total reward to zero.

Topics

Multimodal LLMs
Visual Reasoning
Reinforcement Learning
Process Alignment
LLM Hallucinations
Reward Modeling

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.