MADRAG: Multi-Agent Debate with Retrieval-Augmented Generation for Training-Free Analytic Essay Scoring
Summary
MADRAG, a training-free framework, combines multi-agent reasoning with retrieval-augmented grounding for analytic essay scoring. It addresses biases and unstable scoring common in standard LLM-as-judge approaches by decomposing evaluation into an interactive process involving an Advocate, Skeptic, and a Judge. The Judge is critically augmented with rubric-aligned exemplar retrieval, enabling calibration through comparison with scored examples. MADRAG significantly outperforms prompt-based baselines and approaches supervised system performance without task-specific training. Ablation studies confirm retrieval enhances calibration, while debate improves reasoning on higher-level traits, highlighting the complementary roles of structured interaction and external memory in reliable LLM-based evaluation.
Key takeaway
For AI Scientists and Machine Learning Engineers developing robust, training-free LLM evaluation systems, MADRAG offers a compelling approach. You should consider adopting multi-agent debate with retrieval augmentation to enhance evaluation stability and reduce bias in your LLM-as-judge applications. This method improves reasoning on higher-level traits and drives calibration gains, approaching supervised system performance without extensive task-specific training.
Key insights
MADRAG employs multi-agent debate and retrieval-augmented generation for training-free, reliable analytic essay scoring, surpassing prompt-based LLM baselines.
Principles
- Decompose complex evaluation into interactive agent roles.
- Augment LLM judges with rubric-aligned exemplar retrieval.
- Structured interaction and external memory enhance LLM reliability.
Method
MADRAG employs an Advocate to identify strengths, a Skeptic to critique weaknesses, and a Judge to aggregate arguments. The Judge is calibrated by retrieving and comparing rubric-aligned exemplars.
In practice
- Implement multi-agent systems for complex evaluation tasks.
- Integrate retrieval-augmented generation for LLM calibration.
- Use debate-like structures to improve LLM reasoning.
Topics
- Multi-Agent Systems
- Retrieval-Augmented Generation
- Analytic Essay Scoring
- Large Language Models
- Training-Free Evaluation
- LLM Evaluation
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.