Better Experiments with LLM Evals — A funnel, not a fork

· Source: Spotify Engineering · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, short

Summary

LLM evaluations (evals) are automated judges that assess qualitative dimensions like relevance, coherence, and tone at scale, complementing traditional A/B testing. At Spotify, only 12% of A/B tests yield a shipped positive result, though 64% provide valid learning. Schultzberg and Ottens (2024) propose an "evaluation funnel" where evals precede experiments, verifying if an output meets quality standards, while experiments validate user response and business outcomes. Evals help discard non-promising candidates, raising the hit rate of subsequent experiments, and can generate hypotheses by surfacing unexpected patterns. However, evals cannot measure all dimensions; Spotify teams roll back 42% of launched experiments due to regressions in secondary metrics not caught by offline evaluations. LLM judges introduce a second calibration layer, requiring continuous validation against online outcomes to ensure their scores accurately track real user experiences and business value.

Key takeaway

For MLOps Engineers optimizing LLM-powered applications, integrate LLM evals into an "evaluation funnel" before A/B testing. This approach verifies qualitative improvements early, allowing your experiments to focus on validating real user impact and business outcomes. Continuously calibrate your LLM judges against live A/B test data to ensure their scores accurately reflect user value, preventing costly regressions and building trust in your evaluation system.

Key insights

LLM evals verify quality before experiments, which validate user outcomes, forming a crucial calibration funnel.

Principles

Method

Implement an evaluation funnel: run LLM evals to find best treatments, then use A/B tests to validate user response and monitor guardrail metrics. Close the loop by running evals on A/B test data to calibrate judges.

In practice

Topics

Best for: Machine Learning Engineer, MLOps Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Spotify Engineering.