MADRAG: Multi-Agent Debate with Retrieval-Augmented Generation for Training-Free Analytic Essay Scoring

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Natural Language Processing · Depth: Expert, quick

Summary

MADRAG is a novel, training-free framework designed for analytic essay scoring, integrating multi-agent reasoning with retrieval-augmented grounding. Unlike conventional LLM-as-judge methods, which often exhibit bias and unstable scoring, MADRAG structures evaluation as an interactive debate. An Advocate identifies essay strengths, a Skeptic critiques weaknesses, and a Judge synthesizes these arguments into a final score. A key innovation is the Judge's augmentation with rubric-aligned exemplar retrieval, which enables scoring calibration through comparison with pre-scored examples. Experimental results indicate that MADRAG substantially outperforms prompt-based baselines, achieving performance comparable to supervised systems without requiring task-specific training. Ablation studies confirm that retrieval enhances calibration, while the multi-agent debate improves reasoning for higher-level essay traits.

Key takeaway

For Machine Learning Engineers tasked with developing robust, training-free evaluation systems, MADRAG demonstrates a powerful alternative to traditional LLM-as-judge methods. You should consider implementing multi-agent debate architectures combined with retrieval-augmented generation to achieve higher accuracy and calibration. This approach significantly outperforms prompt-based baselines, reducing the need for extensive task-specific training while improving reasoning on complex traits.

Key insights

Multi-agent debate with retrieval augmentation offers a training-free, reliable approach to analytic essay scoring, outperforming prompt-based LLMs.

Principles

Method

MADRAG employs an Advocate, Skeptic, and Judge. The Advocate finds strengths, the Skeptic critiques weaknesses, and the Judge aggregates arguments, calibrated by rubric-aligned exemplar retrieval.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.