An LLM as arbiter in RAG retrieval: picking the right candidate with reasons
Summary
This article details the "arbiter" component of an enterprise RAG retrieval system, part of the "Enterprise Document Intelligence" series. It introduces a single LLM call that acts as an arbiter, ranking retrieval candidates with explicit reasons, replacing traditional score fusion techniques like Reciprocal Rank Fusion (RRF). The arbiter processes a structured brief for each candidate, incorporating signals from TOC, keyword, and embedding methods, then assigns roles (e.g., "primary", "discarded") and a plain-text justification for auditability. The approach prioritizes keyword and TOC-based retrieval, noting that embeddings often dilute high-signal tokens and lack structural awareness, leading to a 23-point accuracy gap (71% for embedding-only vs. 94% with all methods and dispatching). A "dispatcher" dynamically selects retrieval methods based on question type. The system also emphasizes a robust "not found" mechanism, which keyword methods provide more reliably than embeddings, to prevent costly wrong answers in enterprise settings. The final output is a comprehensive RetrievalResult JSON object, ready for generation and auditing.
Key takeaway
For AI Engineers designing enterprise RAG systems, you should move beyond basic embedding-only retrieval and score fusion. Implement an LLM arbiter that processes structured candidate briefs and provides explicit reasons for ranking decisions, enhancing auditability. Dynamically dispatch retrieval methods (TOC, keywords, embeddings) based on question intent and document structure. Prioritize keyword-based methods for structured documents to ensure reliable "not found" responses, preventing costly wrong answers in compliance or legal contexts.
Key insights
An LLM arbiter, given structured candidate briefs, can rank RAG results with auditable reasons, surpassing score fusion.
Principles
- Score fusion discards crucial "why" signals from individual detectors.
- Keyword-based retrieval reliably proves absence, unlike continuous embedding scores.
- In enterprise RAG, a "not found" response is superior to a confident, wrong answer.
Method
The LLM arbiter processes a structured brief for each candidate, detailing its methods, section, matched_keywords, and snippet. It then assigns a role and reason in a single call, producing a CandidateRanking list.
In practice
- Implement a dispatcher to dynamically select retrieval methods per question.
- Capture LLM arbiter's plain-text reasons for a defensible audit trail.
- Develop expert keyword dictionaries to enable reliable "not found" detection.
Topics
- RAG Retrieval
- LLM Arbiter
- Enterprise AI
- Keyword Search
- Embedding Search
- Audit Trails
- Document Intelligence
Best for: AI Engineer, MLOps Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.