PaLMR: Towards Faithful Visual Reasoning via Multimodal Process Alignment

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

PaLMR (Process Alignment for Multimodal Reasoning) is a framework designed to mitigate process hallucinations in Multimodal Large Language Models (MLLMs), where models achieve correct answers despite flawed visual reasoning. It introduces a perception-aligned data layer (PaDLayer) that generates process-aware reasoning data with structured pseudo-ground-truths and verifiable visual facts. Complementing this, a process-aligned optimization layer (PaOLayer) employs a hierarchical reward fusion scheme, including a process-aware scoring function, to foster visually faithful chains-of-thought and enhance training stability. Experiments on Qwen2.5-VL-7B demonstrate that PaLMR significantly reduces reasoning hallucinations and improves visual reasoning fidelity, achieving state-of-the-art results on HallusionBench while maintaining strong performance across MMMU, MathVista, and MathVerse using approximately 4.7K high-quality samples.

Key takeaway

For AI Scientists and ML Engineers developing multimodal LLMs, focusing solely on final answer correctness in reinforcement learning risks propagating visual hallucinations. You should integrate process-level alignment techniques like PaLMR's hierarchical reward fusion, which prioritizes visual fidelity at each reasoning step. This approach, demonstrated to reduce hallucinations and improve reasoning stability, offers a path to more reliable and interpretable MLLMs, even with smaller, high-quality datasets.

Key insights

Aligning multimodal reasoning processes with visual evidence, not just final answers, is crucial to prevent hallucinations.

Principles

Method

PaLMR constructs perception-aligned data via learnability-based filtering and Gemini-generated pseudo-ground-truths, then optimizes with Vision-Guided GRPO (V-GRPO) using a hierarchical reward function that gates final answer rewards on visual fidelity.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.