Better Experiments with LLM Evals — A funnel, not a fork

2026-05-18 · Source: Spotify Engineering · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, short

Summary

LLM evaluations (evals) are automated judges that assess qualitative dimensions like relevance, coherence, and tone at scale, complementing traditional A/B testing. At Spotify, only 12% of A/B tests yield a shipped positive result, though 64% provide valid learning. Schultzberg and Ottens (2024) propose an "evaluation funnel" where evals precede experiments, verifying if an output meets quality standards, while experiments validate user response and business outcomes. Evals help discard non-promising candidates, raising the hit rate of subsequent experiments, and can generate hypotheses by surfacing unexpected patterns. However, evals cannot measure all dimensions; Spotify teams roll back 42% of launched experiments due to regressions in secondary metrics not caught by offline evaluations. LLM judges introduce a second calibration layer, requiring continuous validation against online outcomes to ensure their scores accurately track real user experiences and business value.

Key takeaway

For MLOps Engineers optimizing LLM-powered applications, integrate LLM evals into an "evaluation funnel" before A/B testing. This approach verifies qualitative improvements early, allowing your experiments to focus on validating real user impact and business outcomes. Continuously calibrate your LLM judges against live A/B test data to ensure their scores accurately reflect user value, preventing costly regressions and building trust in your evaluation system.

Key insights

LLM evals verify quality before experiments, which validate user outcomes, forming a crucial calibration funnel.

Principles

Evals verify quality; experiments validate user impact.
Calibrate eval scores against online outcomes.
Guardrail metrics prevent regressions in secondary dimensions.

Method

Implement an evaluation funnel: run LLM evals to find best treatments, then use A/B tests to validate user response and monitor guardrail metrics. Close the loop by running evals on A/B test data to calibrate judges.

In practice

Use LLM judges to flag trust-breaking content.
Apply evals to A/B test data for calibration.
Monitor secondary metrics with guardrails.

Topics

LLM Evals
A/B Testing
Experimentation Funnel
Model Calibration
User Experience
MLOps

Best for: Machine Learning Engineer, MLOps Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Spotify Engineering.