Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback
Summary
A multi-turn evaluation of Deep Research Agents (DRAs) investigates their ability to improve reports with feedback, moving beyond single-shot output assessments. Researchers conducted tests under self-reflection and process-level feedback, designing Research Gap Inference (RGI) to infer research-process gaps from rubric criteria. Findings published on 2026-06-08 reveal that self-reflection yields negligible net improvement, with agents incorporating and regressing on rubric criteria at similar rates. Conversely, a single round of process-level feedback provides substantial gains, increasing normalized scores by approximately 8-15 points and achieving a 35-40% incorporation rate. However, these gains do not compound; subsequent turns show agents regressing on up to 24% of previously satisfied criteria. This indicates that reliable multi-turn improvement remains elusive for current DRA architectures. Code and results are publicly available.
Key takeaway
For Machine Learning Engineers developing Deep Research Agents, understand that initial process-level feedback improves report quality, but current architectures struggle with compounding gains. You should prioritize single-round, targeted feedback mechanisms and design systems that minimize regression on previously satisfied criteria. Avoid complex multi-turn feedback loops until agents demonstrate robust, non-regressive learning capabilities.
Key insights
Deep Research Agents show initial gains from process-level feedback but struggle with sustained multi-turn improvement due to regression.
Principles
- Self-reflection alone offers negligible agent improvement.
- Targeted process-level feedback significantly boosts agent performance.
- Multi-turn feedback can lead to regression on prior improvements.
Method
Research Gap Inference (RGI) analyzes rubric criteria satisfaction patterns to infer research-process gaps for Deep Research Agents.
In practice
- Implement process-level feedback for initial DRA gains.
- Design feedback to minimize regression on prior work.
- Evaluate DRAs with multi-turn feedback loops.
Topics
- Deep Research Agents
- Multi-turn Evaluation
- Process-Level Feedback
- Research Gap Inference
- Agent Performance
- AI Evaluation Benchmarks
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.