How Braintrust uses AI agents, evals, and CI to ship better software | Ankur Goyal
Summary
Ankur Goyal, CEO of Braintrust, discusses how his company leverages AI agents, rigorous evals, and Continuous Integration (CI) to enhance software development, particularly for complex infrastructure tasks like optimizing slow database queries. He argues that AI agents, specifically using Codex and GPT models, can exhaustively test different algorithms and configurations (e.g., column store formats, execution engines) far more rigorously than human engineers. Braintrust applies this approach to identify slow query patterns, reproduce them, and automatically experiment with solutions, such as bloom filters, to improve performance. Goyal emphasizes that evals serve as a "modern PRD," encoding success criteria and "taste" (like a designer's aesthetic) into quantifiable metrics, allowing models to autonomously iterate and improve, thereby raising the overall quality bar and enabling teams to tackle more challenging technical problems efficiently.
Key takeaway
For engineering leaders and staff engineers tackling complex infrastructure or performance challenges, integrating AI agents and robust evaluation pipelines is crucial. You should invest in CI and build feedback loops that convert real-world data into quantifiable evals, allowing agents to rigorously test solutions like database optimizations. This approach enables your team to achieve higher quality outcomes and address technical debt more efficiently, freeing human engineers for higher-level problem-solving.
Key insights
AI agents and rigorous evals encode "taste" and success criteria, enabling automated, high-quality software development.
Principles
- Evals are the modern version of a PRD.
- There is no excuse for lack of rigor or performance.
- The "agent line" for automation keeps rising.
Method
Reproduce slow query patterns, use coding agents (Codex, GPT models) to exhaustively test database optimizations (e.g., column store formats), and quantify success using AI-generated scoring functions based on defined criteria.
In practice
- Use Codex for hard, long-tail technical problems.
- Build internal background agents for complex infrastructure.
- Invest in cloud development environments for heavy compute.
Topics
- AI Agents
- Evals
- Continuous Integration
- Database Optimization
- Software Engineering Workflow
- Codex
Best for: AI Architect, Machine Learning Engineer, CTO, AI Engineer, Software Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Lenny's Newsletter.