Stop Shipping on Vibes — How to Build Real Evals for Coding Agents
Summary
Jess from Braintrust presented on the importance of evaluations (evals) for AI tooling, emphasizing that shipping AI features based on "vibes" is problematic. Evals provide quantifiable metrics, such as passing 94% of 200 test cases, and help answer critical questions like optimal LLM choice, performance across diverse scenarios (e.g., Python vs. TypeScript, English vs. Japanese), cost-efficiency, adherence to company voice, iterative improvement, and regression detection. The presentation detailed the four major components of an eval: creating a dataset of test cases (golden, edge, failure modes), writing a task (prompt and model selection), developing a scoring system (deterministic, LLM as a judge, human review), and experimenting. A real-life eval compared agentic search and vector search for code bug fixing, using Microsoft's TypeScript Go repo and SweeBench verified (Django-related bugs) datasets. Results showed agentic search generally outperformed vector search in accuracy and cost-efficiency, with vector search often returning insufficient contextual code chunks and leading to more expensive, repeated searches.
Key takeaway
For AI Engineers and MLOps teams building or deploying LLM-powered applications, implementing a structured evaluation framework is essential. Relying on ad-hoc testing risks shipping unreliable features and missing critical regressions, as demonstrated by OpenAI's model revert. You should integrate continuous evals into your CI/CD pipeline, using production data to refine models and ensure consistent performance, especially when comparing search methodologies like agentic vs. vector search, to avoid costly and ineffective deployments.
Key insights
Robust AI evaluations are crucial for quantifiable shipping decisions, performance monitoring, and iterative improvement.
Principles
- Quantify AI performance, avoid "shipping on vibes."
- Evals require diverse datasets and clear scoring criteria.
- Trace visibility is critical for debugging AI agent behavior.
Method
An eval process involves creating datasets, defining tasks, establishing scoring systems (deterministic, LLM-as-judge, human-in-the-loop), and running experiments to compare configurations and track regressions or improvements.
In practice
- Sample production logs (10-20%) to create eval datasets.
- Use `disallow_tools` and explicit prompts to control agent behavior.
- Pass parent span IDs as environment variables for subprocess trace visibility.
Topics
- AI Evals
- Coding Agents
- Agentic Search
- Vector Search
- LLM Evaluation
Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by MLOps.community.