Deliberative Searcher: Improving LLM Reliability via Reinforcement Learning with constraints

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, long

Summary

Deliberative Searcher is a novel framework designed to enhance the reliability of large language models (LLMs) in open-domain question answering by integrating certainty calibration with retrieval-based search. This system employs a "reasoning-primary, information-secondary" paradigm, where the LLM self-assesses its confidence, triggers search and read actions from external sources like Wikipedia when needed, and updates its confidence iteratively before providing a final, confidence-annotated answer. The framework is trained using a constrained reinforcement learning algorithm, specifically an extension of Gradient-Regularized Policy Optimization (GRPO), which optimizes for accuracy under a soft reliability constraint. Empirical results across benchmarks like HotpotQA and GAIA demonstrate that Deliberative Searcher-7B and Deliberative Searcher-72B achieve higher average reliability (0.75) and accuracy (0.35 for 7B, 0.48 for 72B) with lower false-certain rates compared to other baselines.

Key takeaway

For research scientists developing reliable LLM applications, consider implementing a "reasoning-primary, information-secondary" architecture like Deliberative Searcher. This approach, by integrating confidence calibration and constrained reinforcement learning, can significantly improve the alignment between model confidence and correctness, leading to more trustworthy outputs and reducing false-certain responses in open-domain QA systems.

Key insights

Integrating confidence calibration with retrieval-based search via constrained reinforcement learning improves LLM reliability and accuracy.

Principles

Method

The Deliberative Searcher uses a constrained reinforcement learning algorithm, extending GRPO, to optimize for accuracy while maintaining a target reliability threshold by dynamically adjusting a Lagrangian term.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.