Quantifying the Accuracy and Cost Impact of Design Decisions in Budget-Constrained Agentic LLM Search

2025-07-01 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Software Development & Engineering · Depth: Advanced, extended

Summary

A controlled measurement study introduces Budget-Constrained Agentic Search (BCAS), an evaluation harness designed to quantify the accuracy and cost impact of design decisions in agentic Retrieval-Augmented Generation (RAG) systems under explicit budget constraints. The study compares six Large Language Models (LLMs) across three question-answering benchmarks (TriviaQA, HotpotQA, 2WikiMultihopQA), analyzing how search depth, retrieval strategy, and completion token budgets affect performance. Key findings indicate that accuracy generally improves with up to three additional searches, hybrid lexical and dense retrieval with lightweight re-ranking provides the most significant average gains, and larger completion budgets are particularly beneficial for synthesis-heavy tasks like HotpotQA. The BCAS framework is model-agnostic, uses commodity prompts, and logs per-question resource consumption for cost analysis.

Key takeaway

For AI Architects designing cost-effective agentic RAG systems, prioritize allocating budget to iterative search depth and enhancing retrieval quality through hybrid search with re-ranking. Your initial focus should be on "How many searches can we afford?" rather than "How large a completion window should we buy?" This strategy captures most accuracy gains while bounding compute costs, especially for smaller models, which can match larger models' single-search performance with optimized search and retrieval.

Key insights

Budgeting search depth and retrieval quality offers more accuracy gains than expanding token limits for agentic RAG.

Principles

Accuracy improves reliably up to ~3 searches.
Hybrid retrieval with re-ranking yields consistent gains.
Larger token budgets primarily aid multi-hop synthesis.

Method

BCAS uses a stateful loop, providing LLMs with explicit search and token allowances, dynamically gating tool use, and tracking resource consumption to evaluate agentic RAG performance under budget.

In practice

Prioritize iterative search depth (up to 3 steps).
Implement hybrid retrieval with re-ranking for evidence quality.
Expand token limits for multi-hop reasoning tasks only.

Topics

Retrieval-Augmented Generation
Agentic LLMs
Budget-Constrained AI
Hybrid Retrieval
Question Answering

Code references

Best for: AI Scientist, Research Scientist, AI Architect, AI Researcher, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.