How to Find the Agent Failures Your Evals Miss with Scott Clark - #767
Summary
Scott Clark, co-founder and CEO of Distributional, discusses operating and improving complex LLM systems and agents in production. He introduces a "Maslow's hierarchy of observability" comprising telemetry for logging, monitoring for known signals, and analytics for surfacing "unknown unknowns." Clark highlights real-world failures, such as "lazy" tool-use hallucinations that standard evaluations miss, and explains how mapping traces into vector fingerprints enables clustering and topic discovery to uncover emergent behaviors. Analytics can feed the data flywheel by generating new evaluations, guardrails, and training data, which is crucial for non-stationary models. The discussion also covers practical aspects like instrumentation with OpenTelemetry and the GenAI semantic conventions, emphasizing the role of dedicated analytics tools for iterative improvement.
Key takeaway
For AI Architects and Machine Learning Engineers deploying LLM agents, recognize that traditional monitoring alone is insufficient for dynamic, non-stationary models. You should integrate advanced analytics to proactively identify emergent anti-patterns and distributional shifts, ensuring continuous refinement and trustworthiness. This approach helps you move beyond basic debugging to systematically improve agent quality and align performance with business objectives, even as underlying models evolve.
Key insights
Effective LLM operations require a hierarchy of observability, with analytics crucial for discovering unknown issues and driving iterative improvement.
Principles
- Models are non-stationary; online adaptive approaches are essential.
- Analytics identifies "unknown unknowns" beyond standard monitoring.
- Data flywheels need analytics to extract actionable signals from noise.
Method
Map complex LLM traces into vector fingerprints, then use unsupervised learning and clustering (e.g., Clio paper) to identify behavioral sub-patterns and generate LLM-driven explanations and fixes.
In practice
- Instrument LLM systems with OpenTelemetry.
- Adopt the GenAI semantic convention for structured logging.
- Use analytics to generate new evals and guardrails.
Topics
- LLM Agent Observability
- Production System Analytics
- Non-Stationary Models
- Data Flywheel
- OpenTelemetry
Best for: AI Architect, Machine Learning Engineer, CTO, MLOps Engineer, AI Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence).