A 91% eval pass rate shipped our worst regression. We gate on the delta now.

2026-06-22 · Source: LLM on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Data Science & Analytics · Depth: Intermediate, quick

Summary

A continuous integration (CI) evaluation gate, configured with an absolute 90% pass rate threshold, failed to detect a critical regression, leading to a two-day outage. A change with a 91% aggregate pass rate shipped, despite a specific "ambiguous refund requests" slice plummeting from 98% to 74%. This incident highlighted that absolute thresholds only catch catastrophic failures, not subtle drift. The team rewired their gate to compare current performance against the last green run's per-slice scores. It now fails if any slice drops more than 3 points or the aggregate drops over 1.5 points. They also implemented per-slice gating. Furthermore, `baseline.json` updates only on a green `main` branch with human sign-off to prevent baseline ratcheting.

Key takeaway

For MLOps Engineers managing CI/CD pipelines with model evaluation gates, relying solely on absolute pass rate thresholds is a critical vulnerability. You should transition to a delta-based gating strategy. Compare current per-slice and aggregate metrics against a stable baseline from the last successful `main` branch build. This prevents regressions in specific, potentially small, user segments from being masked by overall high performance, ensuring robust and reliable model deployments.

Key insights

Absolute evaluation thresholds are insufficient for regression detection; delta-based gating against a stable baseline is crucial.

Principles

Absolute thresholds catch collapses, not drift.
Aggregates can hide critical slice regressions.
Baselines must update only on green main.

Method

Implement a CI gate comparing current per-slice and aggregate evaluation scores against a `baseline.json` from the last green run. Fail if any slice drops >3 points or aggregate drops >1.5 points.

In practice

Check per-slice variance in green runs.
Verify small slices don't hide in aggregate.
Ensure baseline updates only on green main.

Topics

CI/CD
Model Evaluation
Regression Detection
MLOps
Baseline Management
Per-slice Metrics

Best for: MLOps Engineer, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by LLM on Medium.