Deliberative Searcher: Improving LLM Reliability via Reinforcement Learning with constraints

2025-04-14 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, long

Summary

Deliberative Searcher is a novel framework designed to enhance the reliability of large language models (LLMs) in open-domain question answering by integrating certainty calibration with retrieval-based search. This system employs a "reasoning-primary, information-secondary" paradigm, where the LLM self-assesses its confidence, triggers search and read actions from external sources like Wikipedia when needed, and updates its confidence iteratively before providing a final, confidence-annotated answer. The framework is trained using a constrained reinforcement learning algorithm, specifically an extension of Gradient-Regularized Policy Optimization (GRPO), which optimizes for accuracy under a soft reliability constraint. Empirical results across benchmarks like HotpotQA and GAIA demonstrate that Deliberative Searcher-7B and Deliberative Searcher-72B achieve higher average reliability (0.75) and accuracy (0.35 for 7B, 0.48 for 72B) with lower false-certain rates compared to other baselines.

Key takeaway

For research scientists developing reliable LLM applications, consider implementing a "reasoning-primary, information-secondary" architecture like Deliberative Searcher. This approach, by integrating confidence calibration and constrained reinforcement learning, can significantly improve the alignment between model confidence and correctness, leading to more trustworthy outputs and reducing false-certain responses in open-domain QA systems.

Key insights

Integrating confidence calibration with retrieval-based search via constrained reinforcement learning improves LLM reliability and accuracy.

Principles

Align model confidence with factual correctness.
Prioritize reasoning, then external information.
Provide transparent evidence trails.

Method

The Deliberative Searcher uses a constrained reinforcement learning algorithm, extending GRPO, to optimize for accuracy while maintaining a target reliability threshold by dynamically adjusting a Lagrangian term.

In practice

Use multi-step reflection and verification.
Employ THINK, SEARCH, and READ actions.
Calibrate confidence scores (1-10) with correctness.

Topics

Deliberative Searcher
LLM Reliability
Constrained Reinforcement Learning
Certainty Calibration
Retrieval-Augmented Generation

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.