coSTAR: How We Ship AI Agents at Databricks Fast, Without Breaking Things
Summary
Databricks has developed coSTAR (coupled Scenario, Trace, Assess, Refine), a methodology and framework for testing and deploying AI agents with confidence, addressing challenges like non-determinism, slow feedback loops, cascading errors, and subjective quality. coSTAR employs two mirrored STAR loops: one for refining the agent and another for aligning LLM judges with human expert judgment. The framework uses scenario definitions as test fixtures, MLflow traces for capturing agent execution, and agentic judges for assessing properties of the output rather than exact matches. This approach allows for iterative refinement of agents and judges, ensuring that the test suite evolves and remains aligned with human expertise, similar to how traditional software test suites mature over time.
Key takeaway
For MLOps Engineers deploying AI agents, adopting a structured testing framework like coSTAR is critical. Your team should implement scenario-based testing and leverage agentic judges to manage non-determinism and subjective quality. This approach ensures that agents are refined against reliable evaluations and that your LLM judges remain aligned with human expertise, preventing the deployment of flawed agents with false confidence.
Key insights
coSTAR enables robust AI agent development through coupled loops for agent refinement and judge alignment.
Principles
- Decouple execution from scoring.
- Every production bug becomes a new scenario.
- Test suites evolve, starting simple and growing.
Method
coSTAR defines scenarios, captures traces via MLflow, assesses with agentic judges, and refines agents. A second loop aligns judges with human expert-curated Golden Sets.
In practice
- Use scenario definitions for agent test fixtures.
- Implement agentic judges for nuanced evaluations.
- Curate Golden Sets to align LLM judges.
Topics
- AI Agent Testing
- MLflow
- LLM Judges
- coSTAR Framework
- Agent Refinement
Best for: Machine Learning Engineer, MLOps Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Databricks.