Deliberative Searcher: Improving LLM Reliability via Reinforcement Learning with constraints
Summary
Deliberative Searcher is a novel framework designed to enhance the reliability of large language models (LLMs) in open-domain question answering by integrating certainty calibration with retrieval-based search. This system employs a "reasoning-primary, information-secondary" paradigm, where the LLM self-assesses its confidence, triggers search and read actions from external sources like Wikipedia when needed, and updates its confidence iteratively before providing a final, confidence-annotated answer. The framework is trained using a constrained reinforcement learning algorithm, specifically an extension of Gradient-Regularized Policy Optimization (GRPO), which optimizes for accuracy under a soft reliability constraint. Empirical results across benchmarks like HotpotQA and GAIA demonstrate that Deliberative Searcher-7B and Deliberative Searcher-72B achieve higher average reliability (0.75) and accuracy (0.35 for 7B, 0.48 for 72B) with lower false-certain rates compared to other baselines.
Key takeaway
For research scientists developing reliable LLM applications, consider implementing a "reasoning-primary, information-secondary" architecture like Deliberative Searcher. This approach, by integrating confidence calibration and constrained reinforcement learning, can significantly improve the alignment between model confidence and correctness, leading to more trustworthy outputs and reducing false-certain responses in open-domain QA systems.
Key insights
Integrating confidence calibration with retrieval-based search via constrained reinforcement learning improves LLM reliability and accuracy.
Principles
- Align model confidence with factual correctness.
- Prioritize reasoning, then external information.
- Provide transparent evidence trails.
Method
The Deliberative Searcher uses a constrained reinforcement learning algorithm, extending GRPO, to optimize for accuracy while maintaining a target reliability threshold by dynamically adjusting a Lagrangian term.
In practice
- Use multi-step reflection and verification.
- Employ THINK, SEARCH, and READ actions.
- Calibrate confidence scores (1-10) with correctness.
Topics
- Deliberative Searcher
- LLM Reliability
- Constrained Reinforcement Learning
- Certainty Calibration
- Retrieval-Augmented Generation
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.