Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering
Summary
CorVer (Corpus Verify) is a novel, lightweight process reward designed to improve factual accuracy in knowledge-intensive question answering using reinforcement learning. It addresses the limitations of expensive and unreliable neural verifiers by employing a corpus-grounded signal derived from Wikipedia co-occurrence statistics for sentence-level credit assignment. This plug-in-ready system maps sentence-level feedback to token-level advantages with a simple alignment, requiring only a 0.5B extractor and a single corpus lookup per sentence. Evaluated across 30 (model, benchmark) cells, encompassing six instruction-tuned models (3B to 14B) and five QA benchmarks, CorVer consistently improved over raw baselines in every cell, achieving an average TriviaQA gain of +4.1 percentage points. Furthermore, it outperformed four neural-verifier baselines in 18 of 20 feasible configurations, while training 4.8 to 8.4 times faster.
Key takeaway
For Machine Learning Engineers developing factual question answering systems with reinforcement learning, CorVer offers a compelling alternative to costly neural verifiers. You should consider integrating this lightweight, corpus-grounded process supervision to achieve significant gains in factual accuracy and accelerate training times by 4.8 to 8.4x. This approach allows you to deploy more reliable and efficient reward signals, especially for rare-entity facts, without the overhead of complex verification pipelines.
Key insights
CorVer uses Wikipedia co-occurrence for lightweight, corpus-grounded process supervision to boost factual QA accuracy.
Principles
- Corpus-grounded signals can replace expensive neural verifiers.
- Fine-grained, sentence-level rewards improve RL for factual QA.
- Wikipedia co-occurrence statistics offer reliable factual verification.
Method
CorVer assigns sentence-level credit using Wikipedia co-occurrence, then aligns this to token-level advantages. It requires a 0.5B extractor and one corpus lookup per sentence.
In practice
- Improve factual accuracy in knowledge-intensive QA.
- Accelerate RL training for factual verification.
- Deploy lightweight process supervision.
Topics
- Reinforcement Learning
- Factual Question Answering
- Reward Design
- Corpus-Grounded Supervision
- Wikipedia Co-occurrence
- Large Language Models
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.