PaLMR: Towards Faithful Visual Reasoning via Multimodal Process Alignment
Summary
PaLMR (Process Alignment for Multimodal Reasoning) is a framework designed to mitigate process hallucinations in Multimodal Large Language Models (MLLMs), where models achieve correct answers despite flawed visual reasoning. It introduces a perception-aligned data layer (PaDLayer) that generates process-aware reasoning data with structured pseudo-ground-truths and verifiable visual facts. Complementing this, a process-aligned optimization layer (PaOLayer) employs a hierarchical reward fusion scheme, including a process-aware scoring function, to foster visually faithful chains-of-thought and enhance training stability. Experiments on Qwen2.5-VL-7B demonstrate that PaLMR significantly reduces reasoning hallucinations and improves visual reasoning fidelity, achieving state-of-the-art results on HallusionBench while maintaining strong performance across MMMU, MathVista, and MathVerse using approximately 4.7K high-quality samples.
Key takeaway
For AI Scientists and ML Engineers developing multimodal LLMs, focusing solely on final answer correctness in reinforcement learning risks propagating visual hallucinations. You should integrate process-level alignment techniques like PaLMR's hierarchical reward fusion, which prioritizes visual fidelity at each reasoning step. This approach, demonstrated to reduce hallucinations and improve reasoning stability, offers a path to more reliable and interpretable MLLMs, even with smaller, high-quality datasets.
Key insights
Aligning multimodal reasoning processes with visual evidence, not just final answers, is crucial to prevent hallucinations.
Principles
- Reward mechanisms must supervise reasoning process faithfulness, not just outcome correctness.
- Hierarchical reward fusion, prioritizing visual fidelity, enhances training stability.
- Pairwise comparison for visual consistency scoring is more robust than point-wise methods.
Method
PaLMR constructs perception-aligned data via learnability-based filtering and Gemini-generated pseudo-ground-truths, then optimizes with Vision-Guided GRPO (V-GRPO) using a hierarchical reward function that gates final answer rewards on visual fidelity.
In practice
- Generate structured, question-agnostic visual ground truths using powerful LLMs like Gemini.
- Implement pairwise visual fidelity scoring with an LLM-as-judge (e.g., Qwen3-30B-A3B).
- Design reward functions to strictly penalize perceptual errors, setting total reward to zero.
Topics
- Multimodal LLMs
- Visual Reasoning
- Reinforcement Learning
- Process Alignment
- LLM Hallucinations
- Reward Modeling
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.