How Braintrust uses AI agents, evals, and CI to ship better software | Ankur Goyal

· Source: Lenny's Newsletter · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Advanced, extended

Summary

Ankur Goyal, CEO of Braintrust, discusses how his company leverages AI agents, rigorous evals, and Continuous Integration (CI) to enhance software development, particularly for complex infrastructure tasks like optimizing slow database queries. He argues that AI agents, specifically using Codex and GPT models, can exhaustively test different algorithms and configurations (e.g., column store formats, execution engines) far more rigorously than human engineers. Braintrust applies this approach to identify slow query patterns, reproduce them, and automatically experiment with solutions, such as bloom filters, to improve performance. Goyal emphasizes that evals serve as a "modern PRD," encoding success criteria and "taste" (like a designer's aesthetic) into quantifiable metrics, allowing models to autonomously iterate and improve, thereby raising the overall quality bar and enabling teams to tackle more challenging technical problems efficiently.

Key takeaway

For engineering leaders and staff engineers tackling complex infrastructure or performance challenges, integrating AI agents and robust evaluation pipelines is crucial. You should invest in CI and build feedback loops that convert real-world data into quantifiable evals, allowing agents to rigorously test solutions like database optimizations. This approach enables your team to achieve higher quality outcomes and address technical debt more efficiently, freeing human engineers for higher-level problem-solving.

Key insights

AI agents and rigorous evals encode "taste" and success criteria, enabling automated, high-quality software development.

Principles

Method

Reproduce slow query patterns, use coding agents (Codex, GPT models) to exhaustively test database optimizations (e.g., column store formats), and quantify success using AI-generated scoring functions based on defined criteria.

In practice

Topics

Best for: AI Architect, Machine Learning Engineer, CTO, AI Engineer, Software Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Lenny's Newsletter.