When Helping Hurts and How to Fix It: Multi-Agent Debate for Data Cleaning
Summary
A study across three benchmarks, four model families, and over 6,000 task-condition pairs investigates the impact of multi-agent debate on data cleaning. It reveals that debate degrades generative tasks, causing a -1.6 to -15.5 percentage point (pp) drop across all four models due to "critique-induced confusion" (CIC), where generators accept hallucinated Critic feedback. Conversely, debate significantly improves error detection, boosting F1 scores by +27.4pp (d=1.0). The research derives a debate benefit condition: debate helps when the probability of rescuing a wrong output outweighs destroying a correct one. A factorial experiment demonstrates that adversarial separation, involving a separate Critic with code-execution grounding and evidence-gated generation, is crucial. This configuration is the first to significantly surpass single-agent performance on a generative task, achieving a +5.3pp improvement (p<0.05). This condition accurately predicts all nine task types and generalizes across 19 published comparisons in seven domains.
Key takeaway
For Machine Learning Engineers designing multi-agent systems for data cleaning, recognize that naive debate configurations can degrade generative task performance by -1.6 to -15.5pp. To achieve positive gains, implement adversarial separation with a dedicated Critic agent that uses code-execution grounding and evidence-gated generation. This approach can yield significant improvements, such as +5.3pp on generative tasks, by mitigating "critique-induced confusion" and ensuring reliable feedback.
Key insights
Multi-agent debate degrades generative data cleaning via "critique-induced confusion" but improves error detection, requiring adversarial separation for benefit.
Principles
- Debate can degrade generation but improve error detection.
- Benefit requires P(rescue wrong) > P(destroy correct).
- Adversarial separation is crucial for debate efficacy.
Method
Employ a separate Critic agent with code-execution grounding and evidence-gated generation to prevent "critique-induced confusion" and achieve generative task improvements in multi-agent debate.
In practice
- Implement separate Critic and Generator agents.
- Ground Critic feedback with code execution.
- Gate Generator output based on evidence.
Topics
- Multi-agent Systems
- Data Cleaning
- Generative AI
- Error Detection
- Critique-Induced Confusion
- Adversarial Separation
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.