Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

Deep-research agents solve complex tasks via long trajectories involving search, tool use, and answer synthesis. Evaluating only final answers fails to pinpoint *where* errors occur within these trajectories. This research introduces span-level error localization, analyzing 2,790 real trajectories across two agent frameworks, three backbone models, and three benchmarks. Raw logs were converted to semantic spans and annotated for harmful errors using LLM-assisted expert review. This process created TELBench, a 1,000-instance benchmark for identifying error spans. The study also proposes DRIFT, a claim-centric auditing framework that tracks agent claims, verifies their support in trajectory evidence, and marks problematic spans. DRIFT improves span-level error localization and first-error accuracy by up to 30 percentage points, providing a crucial process-level view of agent reliability.

Key takeaway

For MLOps Engineers deploying deep-research agents, understanding *why* an agent fails is critical for robust system development. You should integrate span-level error localization techniques like DRIFT to move beyond final answer evaluation. This allows you to pinpoint specific trajectory errors, improving debugging efficiency and agent reliability by up to 30 percentage points in first-error accuracy.

Key insights

Span-level error localization provides a process-level view of deep-research agent reliability, identifying specific failure points in trajectories.

Principles

Method

DRIFT is a claim-centric auditing framework that tracks agent claims, verifies their support in trajectory evidence, and marks spans where unsupported or conflicting claims affect the answer path.

In practice

Topics

Best for: Research Scientist, AI Architect, AI Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.