LLM Observability, Evaluation, Experimentation Platform — Dat Ngo, Arize
Summary
Dat Ngo from Arize AI presented their platform for managing and improving large language model (LLM) applications, focusing on observability, evaluation, and experimentation. The platform addresses challenges in making AI work, drawing from experience with large enterprises and processing vast token volumes (100 billion to 1 trillion tokens last year). Observability is built on OpenTelemetry, providing detailed traces, spans, sessions, distributional views, and trajectory evaluations to audit agent behavior and identify performance issues like latency or incorrect component ordering. Evaluation methods include LLM as a judge, human feedback, golden datasets, and deterministic checks, considering both technical and subject matter expert personas. The platform supports running experiments to test changes in prompts, models, or orchestration, with a strong emphasis on automating the entire improvement flywheel. Arize offers two products: the open-source Arize Phoenix for local deployment and the enterprise-grade Arize AX, used by companies like Uber and Booking.
Key takeaway
For AI Architects or MLOps Engineers building production LLM applications, you must implement robust observability and evaluation frameworks from the outset. Relying solely on manual checks is unsustainable given the non-deterministic nature of LLMs. You should integrate OpenTelemetry for deep visibility into agent behavior. Automate evaluation processes, including LLM-as-a-judge and deterministic checks, to efficiently identify regressions and drive continuous improvement. This structured approach is critical for scaling and maintaining reliable AI systems.
Key insights
LLM application development demands structured observability, evaluation, and automated experimentation to manage non-deterministic behavior and ensure continuous improvement.
Principles
- AI development mirrors traditional software engineering.
- OpenTelemetry is foundational for LLM observability.
- Evaluation requires diverse signal types and scopes.
Method
Observe LLM agent behavior using OpenTelemetry traces and sessions. Evaluate performance with LLM-as-a-judge, human feedback, or deterministic checks across various scopes. Experiment with changes to prompts/models and automate the improvement flywheel.
In practice
- Implement OpenTelemetry for agent audit records.
- Combine LLM-as-a-judge with golden datasets.
- Automate eval creation and experiment execution.
Topics
- LLM Observability
- AI Agent Evaluation
- OpenTelemetry
- LLM Experimentation
- MLOps
- Arize AI
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Architect, AI Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI Engineer.