coSTAR: How We Ship AI Agents at Databricks Fast, Without Breaking Things

2026-03-20 · Source: Databricks · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, medium

Summary

Databricks has developed coSTAR (coupled Scenario, Trace, Assess, Refine), a methodology and framework for testing and deploying AI agents with confidence, addressing challenges like non-determinism, slow feedback loops, cascading errors, and subjective quality. coSTAR employs two mirrored STAR loops: one for refining the agent and another for aligning LLM judges with human expert judgment. The framework uses scenario definitions as test fixtures, MLflow traces for capturing agent execution, and agentic judges for assessing properties of the output rather than exact matches. This approach allows for iterative refinement of agents and judges, ensuring that the test suite evolves and remains aligned with human expertise, similar to how traditional software test suites mature over time.

Key takeaway

For MLOps Engineers deploying AI agents, adopting a structured testing framework like coSTAR is critical. Your team should implement scenario-based testing and leverage agentic judges to manage non-determinism and subjective quality. This approach ensures that agents are refined against reliable evaluations and that your LLM judges remain aligned with human expertise, preventing the deployment of flawed agents with false confidence.

Key insights

coSTAR enables robust AI agent development through coupled loops for agent refinement and judge alignment.

Principles

Decouple execution from scoring.
Every production bug becomes a new scenario.
Test suites evolve, starting simple and growing.

Method

coSTAR defines scenarios, captures traces via MLflow, assesses with agentic judges, and refines agents. A second loop aligns judges with human expert-curated Golden Sets.

In practice

Use scenario definitions for agent test fixtures.
Implement agentic judges for nuanced evaluations.
Curate Golden Sets to align LLM judges.

Topics

AI Agent Testing
MLflow
LLM Judges
coSTAR Framework
Agent Refinement

Best for: Machine Learning Engineer, MLOps Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Databricks.