Stop Shipping on Vibes — How to Build Real Evals for Coding Agents

2026-03-31 · Source: MLOps.community · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Data Science & Analytics · Depth: Intermediate, extended

Summary

Jess from Braintrust presented on the importance of evaluations (evals) for AI tooling, emphasizing that shipping AI features based on "vibes" is problematic. Evals provide quantifiable metrics, such as passing 94% of 200 test cases, and help answer critical questions like optimal LLM choice, performance across diverse scenarios (e.g., Python vs. TypeScript, English vs. Japanese), cost-efficiency, adherence to company voice, iterative improvement, and regression detection. The presentation detailed the four major components of an eval: creating a dataset of test cases (golden, edge, failure modes), writing a task (prompt and model selection), developing a scoring system (deterministic, LLM as a judge, human review), and experimenting. A real-life eval compared agentic search and vector search for code bug fixing, using Microsoft's TypeScript Go repo and SweeBench verified (Django-related bugs) datasets. Results showed agentic search generally outperformed vector search in accuracy and cost-efficiency, with vector search often returning insufficient contextual code chunks and leading to more expensive, repeated searches.

Key takeaway

For AI Engineers and MLOps teams building or deploying LLM-powered applications, implementing a structured evaluation framework is essential. Relying on ad-hoc testing risks shipping unreliable features and missing critical regressions, as demonstrated by OpenAI's model revert. You should integrate continuous evals into your CI/CD pipeline, using production data to refine models and ensure consistent performance, especially when comparing search methodologies like agentic vs. vector search, to avoid costly and ineffective deployments.

Key insights

Robust AI evaluations are crucial for quantifiable shipping decisions, performance monitoring, and iterative improvement.

Principles

Quantify AI performance, avoid "shipping on vibes."
Evals require diverse datasets and clear scoring criteria.
Trace visibility is critical for debugging AI agent behavior.

Method

An eval process involves creating datasets, defining tasks, establishing scoring systems (deterministic, LLM-as-judge, human-in-the-loop), and running experiments to compare configurations and track regressions or improvements.

In practice

Sample production logs (10-20%) to create eval datasets.
Use `disallow_tools` and explicit prompts to control agent behavior.
Pass parent span IDs as environment variables for subprocess trace visibility.

Topics

AI Evals
Coding Agents
Agentic Search
Vector Search
LLM Evaluation

Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by MLOps.community.