An LLM as arbiter in RAG retrieval: picking the right candidate with reasons

2026-06-25 · Source: Towards Data Science · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, extended

Summary

This article details the "arbiter" component of an enterprise RAG retrieval system, part of the "Enterprise Document Intelligence" series. It introduces a single LLM call that acts as an arbiter, ranking retrieval candidates with explicit reasons, replacing traditional score fusion techniques like Reciprocal Rank Fusion (RRF). The arbiter processes a structured brief for each candidate, incorporating signals from TOC, keyword, and embedding methods, then assigns roles (e.g., "primary", "discarded") and a plain-text justification for auditability. The approach prioritizes keyword and TOC-based retrieval, noting that embeddings often dilute high-signal tokens and lack structural awareness, leading to a 23-point accuracy gap (71% for embedding-only vs. 94% with all methods and dispatching). A "dispatcher" dynamically selects retrieval methods based on question type. The system also emphasizes a robust "not found" mechanism, which keyword methods provide more reliably than embeddings, to prevent costly wrong answers in enterprise settings. The final output is a comprehensive RetrievalResult JSON object, ready for generation and auditing.

Key takeaway

For AI Engineers designing enterprise RAG systems, you should move beyond basic embedding-only retrieval and score fusion. Implement an LLM arbiter that processes structured candidate briefs and provides explicit reasons for ranking decisions, enhancing auditability. Dynamically dispatch retrieval methods (TOC, keywords, embeddings) based on question intent and document structure. Prioritize keyword-based methods for structured documents to ensure reliable "not found" responses, preventing costly wrong answers in compliance or legal contexts.

Key insights

An LLM arbiter, given structured candidate briefs, can rank RAG results with auditable reasons, surpassing score fusion.

Principles

Score fusion discards crucial "why" signals from individual detectors.
Keyword-based retrieval reliably proves absence, unlike continuous embedding scores.
In enterprise RAG, a "not found" response is superior to a confident, wrong answer.

Method

The LLM arbiter processes a structured brief for each candidate, detailing its methods, section, matched_keywords, and snippet. It then assigns a role and reason in a single call, producing a CandidateRanking list.

In practice

Implement a dispatcher to dynamically select retrieval methods per question.
Capture LLM arbiter's plain-text reasons for a defensible audit trail.
Develop expert keyword dictionaries to enable reliable "not found" detection.

Topics

RAG Retrieval
LLM Arbiter
Enterprise AI
Keyword Search
Embedding Search
Audit Trails
Document Intelligence

Best for: AI Engineer, MLOps Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.