A 91% eval pass rate shipped our worst regression. We gate on the delta now.
Summary
A continuous integration (CI) evaluation gate, configured with an absolute 90% pass rate threshold, failed to detect a critical regression, leading to a two-day outage. A change with a 91% aggregate pass rate shipped, despite a specific "ambiguous refund requests" slice plummeting from 98% to 74%. This incident highlighted that absolute thresholds only catch catastrophic failures, not subtle drift. The team rewired their gate to compare current performance against the last green run's per-slice scores. It now fails if any slice drops more than 3 points or the aggregate drops over 1.5 points. They also implemented per-slice gating. Furthermore, `baseline.json` updates only on a green `main` branch with human sign-off to prevent baseline ratcheting.
Key takeaway
For MLOps Engineers managing CI/CD pipelines with model evaluation gates, relying solely on absolute pass rate thresholds is a critical vulnerability. You should transition to a delta-based gating strategy. Compare current per-slice and aggregate metrics against a stable baseline from the last successful `main` branch build. This prevents regressions in specific, potentially small, user segments from being masked by overall high performance, ensuring robust and reliable model deployments.
Key insights
Absolute evaluation thresholds are insufficient for regression detection; delta-based gating against a stable baseline is crucial.
Principles
- Absolute thresholds catch collapses, not drift.
- Aggregates can hide critical slice regressions.
- Baselines must update only on green main.
Method
Implement a CI gate comparing current per-slice and aggregate evaluation scores against a `baseline.json` from the last green run. Fail if any slice drops >3 points or aggregate drops >1.5 points.
In practice
- Check per-slice variance in green runs.
- Verify small slices don't hide in aggregate.
- Ensure baseline updates only on green main.
Topics
- CI/CD
- Model Evaluation
- Regression Detection
- MLOps
- Baseline Management
- Per-slice Metrics
Best for: MLOps Engineer, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by LLM on Medium.