When Helping Hurts and How to Fix It: Multi-Agent Debate for Data Cleaning

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

A study across three benchmarks, four model families, and over 6,000 task-condition pairs investigates the impact of multi-agent debate on data cleaning. It reveals that debate degrades generative tasks, causing a -1.6 to -15.5 percentage point (pp) drop across all four models due to "critique-induced confusion" (CIC), where generators accept hallucinated Critic feedback. Conversely, debate significantly improves error detection, boosting F1 scores by +27.4pp (d=1.0). The research derives a debate benefit condition: debate helps when the probability of rescuing a wrong output outweighs destroying a correct one. A factorial experiment demonstrates that adversarial separation, involving a separate Critic with code-execution grounding and evidence-gated generation, is crucial. This configuration is the first to significantly surpass single-agent performance on a generative task, achieving a +5.3pp improvement (p<0.05). This condition accurately predicts all nine task types and generalizes across 19 published comparisons in seven domains.

Key takeaway

For Machine Learning Engineers designing multi-agent systems for data cleaning, recognize that naive debate configurations can degrade generative task performance by -1.6 to -15.5pp. To achieve positive gains, implement adversarial separation with a dedicated Critic agent that uses code-execution grounding and evidence-gated generation. This approach can yield significant improvements, such as +5.3pp on generative tasks, by mitigating "critique-induced confusion" and ensuring reliable feedback.

Key insights

Multi-agent debate degrades generative data cleaning via "critique-induced confusion" but improves error detection, requiring adversarial separation for benefit.

Principles

Method

Employ a separate Critic agent with code-execution grounding and evidence-gated generation to prevent "critique-induced confusion" and achieve generative task improvements in multi-agent debate.

In practice

Topics

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.