LLM Observability, Evaluation, Experimentation Platform — Dat Ngo, Arize

· Source: AI Engineer · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Software Development & Engineering · Depth: Advanced, long

Summary

Dat Ngo from Arize AI presented their platform for managing and improving large language model (LLM) applications, focusing on observability, evaluation, and experimentation. The platform addresses challenges in making AI work, drawing from experience with large enterprises and processing vast token volumes (100 billion to 1 trillion tokens last year). Observability is built on OpenTelemetry, providing detailed traces, spans, sessions, distributional views, and trajectory evaluations to audit agent behavior and identify performance issues like latency or incorrect component ordering. Evaluation methods include LLM as a judge, human feedback, golden datasets, and deterministic checks, considering both technical and subject matter expert personas. The platform supports running experiments to test changes in prompts, models, or orchestration, with a strong emphasis on automating the entire improvement flywheel. Arize offers two products: the open-source Arize Phoenix for local deployment and the enterprise-grade Arize AX, used by companies like Uber and Booking.

Key takeaway

For AI Architects or MLOps Engineers building production LLM applications, you must implement robust observability and evaluation frameworks from the outset. Relying solely on manual checks is unsustainable given the non-deterministic nature of LLMs. You should integrate OpenTelemetry for deep visibility into agent behavior. Automate evaluation processes, including LLM-as-a-judge and deterministic checks, to efficiently identify regressions and drive continuous improvement. This structured approach is critical for scaling and maintaining reliable AI systems.

Key insights

LLM application development demands structured observability, evaluation, and automated experimentation to manage non-deterministic behavior and ensure continuous improvement.

Principles

Method

Observe LLM agent behavior using OpenTelemetry traces and sessions. Evaluate performance with LLM-as-a-judge, human feedback, or deterministic checks across various scopes. Experiment with changes to prompts/models and automate the improvement flywheel.

In practice

Topics

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Architect, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Engineer.