RASER: Recoverability-Aware Selective Escalation Router for Multi-Hop Question Answering
Summary
RASER (Recoverability-Aware Selective Escalation Router) is a new system designed to optimize token costs in multi-hop question-answering (QA) by selectively escalating retrieval. Traditional multi-hop QA often incurs high LLM token expenses due to repeated calls for question decomposition or iterative retrieval, despite many questions being solvable by a single one-shot RAG. RASER addresses this by using one-shot RAG and six features to decide whether to stop or escalate to more complex retrieval actions like PRUNE or iterative retrieval (IRCoT), without requiring additional LLM calls for the decision itself. Across six LLMs and three multi-hop QA benchmarks, RASER achieves F1 scores competitive with state-of-the-art baselines while reducing token consumption to only 41-49% of an "always-prune" strategy and less than other iterative or decomposition baselines.
Key takeaway
For NLP Engineers building multi-hop question-answering systems, you should consider implementing a selective escalation router like RASER. This approach significantly reduces LLM token costs by identifying questions solvable with one-shot RAG, escalating only when necessary. You can achieve competitive F1 scores while cutting token expenditure by over 50% compared to always-prune strategies, optimizing your operational budget and improving system efficiency.
Key insights
RASER reduces multi-hop QA token costs by selectively escalating retrieval based on one-shot RAG features.
Principles
- Many multi-hop questions are solvable by one-shot RAG.
- Selective escalation optimizes LLM budget use.
- Explicit cost-accuracy trade-offs improve retrieval efficiency.
Method
RASER uses one-shot RAG and six features to decide between stopping, PRUNE, or iterative retrieval (IRCoT), without extra LLM calls for decision-making.
In practice
- Integrate RASER to optimize LLM token usage in multi-hop QA.
- Analyze one-shot RAG performance to identify questions needing escalation.
Topics
- Multi-hop QA
- Retrieval-Augmented Generation
- LLM Cost Optimization
- Token Efficiency
- RASER
- Iterative Retrieval
Best for: AI Engineer, Machine Learning Engineer, CTO, AI Scientist, NLP Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.