Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories

2026-06-01 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

Deep-research agents solve complex tasks via long trajectories involving search, tool use, and answer synthesis. Evaluating only final answers fails to pinpoint *where* errors occur within these trajectories. This research introduces span-level error localization, analyzing 2,790 real trajectories across two agent frameworks, three backbone models, and three benchmarks. Raw logs were converted to semantic spans and annotated for harmful errors using LLM-assisted expert review. This process created TELBench, a 1,000-instance benchmark for identifying error spans. The study also proposes DRIFT, a claim-centric auditing framework that tracks agent claims, verifies their support in trajectory evidence, and marks problematic spans. DRIFT improves span-level error localization and first-error accuracy by up to 30 percentage points, providing a crucial process-level view of agent reliability.

Key takeaway

For MLOps Engineers deploying deep-research agents, understanding *why* an agent fails is critical for robust system development. You should integrate span-level error localization techniques like DRIFT to move beyond final answer evaluation. This allows you to pinpoint specific trajectory errors, improving debugging efficiency and agent reliability by up to 30 percentage points in first-error accuracy.

Key insights

Span-level error localization provides a process-level view of deep-research agent reliability, identifying specific failure points in trajectories.

Principles

Evaluate agent reliability at the span-level.
Track agent claims against trajectory evidence.
Identify unsupported or conflicting claims.

Method

DRIFT is a claim-centric auditing framework that tracks agent claims, verifies their support in trajectory evidence, and marks spans where unsupported or conflicting claims affect the answer path.

In practice

Use TELBench for error span identification.
Implement claim-centric auditing with DRIFT.
Improve first-error accuracy by 30 percentage points.

Topics

Deep-Research Agents
Error Localization
Agent Evaluation
Auditing Frameworks
LLM-Assisted Review
TELBench
DRIFT

Best for: Research Scientist, AI Architect, AI Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.