From Verdict to Process: Agentic Reinforcement Learning for Multi-Stage Fact Verification
Summary
ProFact is an agentic reinforcement learning framework designed for end-to-end optimization of multi-stage fact verification. It addresses limitations of existing methods that optimize stages in isolation by training a unified policy to coordinate claim decomposition, evidence seeking, answer generation, and verdict prediction. ProFact introduces process-aware rewards, providing dense, stage-level learning signals to overcome sparse and delayed supervision from final veracity labels. Empirical evaluation on AVeriTeC demonstrates ProFact consistently outperforms strong baselines in verification performance and inference efficiency across four open-source backbone models (Qwen2.5-3B, Qwen2.5-7B, Qwen3-4B, Qwen3-8B). It also shows improved intermediate evidence quality and overall AVeriTeC Score, reducing token costs.
Key takeaway
For AI Scientists and Machine Learning Engineers developing fact verification systems, ProFact demonstrates that unifying multi-stage workflows with agentic reinforcement learning and process-aware rewards significantly improves accuracy and efficiency. You should consider adopting end-to-end policy optimization and dense intermediate feedback to overcome sparse supervision challenges in complex, multi-step reasoning tasks, potentially reducing token costs and improving overall system performance.
Key insights
End-to-end agentic reinforcement learning with process-aware rewards optimizes multi-stage fact verification for better performance and efficiency.
Principles
- Multi-stage verification benefits from unified policy optimization.
- Dense, stage-level rewards improve credit assignment in long-horizon tasks.
- Larger LLMs do not always yield more reliable evidence-grounded verification.
Method
ProFact formulates verification as a three-stage Markov Decision Process (Question, Search, Verdict), optimized end-to-end using Group-Relative Policy Optimization (GRPO) with process-aware METEOR-based rewards.
In practice
- Implement process-aware rewards for complex, multi-stage workflows.
- Use GRPO for stable multi-stage RL training in agentic LLMs.
- Evaluate LLM size impact on evidence-grounded reasoning tasks.
Topics
- Agentic Reinforcement Learning
- Fact Verification
- Large Language Models
- Multi-stage Workflows
- Process-Aware Rewards
- GRPO
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.