How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust

2026-05-28 · Source: AI Engineer · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Intermediate, long

Summary

Phil Hetzel from Braintrust details how agent observability fundamentally differs from traditional observability, which primarily focuses on uptime and technical performance metrics like latency and error rates. Agent observability, in contrast, must contend with the non-deterministic nature of LLM agents, requiring qualitative metrics such as grounding, tool usage, and brand alignment. Agent traces are significantly more complex, semi-structured, voluminous (often over a gigabyte with 20MB spans), and fast-moving, necessitating specialized database designs for ingestion, indexing, and full-text search, exemplified by Braintrust's custom database utilizing a forked Tantivy index. Furthermore, effective agent observability involves diverse personas, including non-technical subject matter experts like clinicians or lawyers, who contribute to improving agent performance through natural language prompts and human annotation workflows. Braintrust is also developing LLM-driven topic modeling and sentiment analysis on traces to accelerate the iteration loop between identifying production problems and implementing fixes.

Key takeaway

For AI Engineers or AI Product Managers deploying generative AI agents, recognize that traditional observability tools are insufficient. You must adopt specialized agent observability platforms that handle non-deterministic behavior, process complex, voluminous trace data, and integrate feedback from non-technical domain experts. Prioritize solutions that offer robust indexing and full-text search capabilities to efficiently diagnose agent performance and accelerate your iteration cycles.

Key insights

Agent observability requires specialized approaches due to LLM non-determinism, complex trace data, and diverse stakeholder involvement.

Principles

LLM agents are non-deterministic, demanding qualitative performance metrics.
Agent traces are voluminous and semi-structured, requiring custom data infrastructure.
Non-technical subject matter experts are crucial for effective agent evaluation.

Method

Braintrust developed a custom database with write-ahead logs, indexing, and a Tantivy-based full-text index to manage large, semi-structured agent traces for real-time and analytical queries.

In practice

Implement qualitative metrics like grounding and brand alignment for agent evaluation.
Utilize human annotation to identify agent failure modes and refine automated scoring.
Employ LLM-driven clustering on traces for topic modeling and sentiment analysis.

Topics

Agent Observability
Generative AI
LLM Agents
Trace Data Management
Database Design
Full-Text Indexing

Best for: AI Architect, AI Engineer, MLOps Engineer, AI Product Manager

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Engineer.