Build a feedback loop that automatically turns real-world data into evals

2026-06-16 · Source: How I AI · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, quick

Summary

The article describes a method for building robust evaluation systems, termed "hard evals," for code-writing models, particularly in the context of database query optimization. The approach involves identifying patterns of slow real-world queries, reproducing these scenarios, and then employing a coding agent to explore various optimization strategies. This includes exhaustively testing different open-source column store formats and execution engines to compute a matrix of performance outcomes. The objective is to leverage the model's code-writing capabilities to creatively improve system performance by providing challenging, data-driven tests that reflect actual user behavior and system bottlenecks. This feedback loop aims to continuously enhance model efficacy in practical applications.

Key takeaway

For Machine Learning Engineers optimizing code-generating models, you should integrate real-world performance data into your evaluation pipelines. By identifying and reproducing actual slow queries, you can create "hard evals" that directly challenge your models to find practical optimizations. This approach enables your coding agents to exhaustively explore solutions, like different column store formats, leading to more robust and performant model outputs in production.

Key insights

Creating "hard evals" from real-world slow queries drives significant model improvement.

Principles

Reproduce real-world performance bottlenecks.
Exhaustively test optimization permutations.
Leverage agents for creative problem-solving.

Method

Identify slow query patterns, reproduce them, then use a coding agent to exhaustively test database optimization ideas, including various column store formats and execution engines, to compute performance matrices.

In practice

Analyze query logs for performance issues.
Implement coding agents for database tuning.
Benchmark diverse column store configurations.

Topics

Hard Evals
Code Generation Models
Database Optimization
Query Performance
Column Store Formats
Coding Agents

Best for: AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by How I AI.