Exploring Statistical Change Point Detection Techniques for Performance Anomaly Detection at Mozilla

2026-06-18 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Software Development & Engineering, Artificial Intelligence & Machine Learning · Depth: Advanced, extended

Summary

An empirical study at Mozilla evaluated 25 change-point detection (CPD) methods and 15 ensemble approaches to improve performance anomaly detection within its Perfherder system. Mozilla's current Student's T-test-based method generates 12.5% false positives and misses approximately 6.8% of regressions. Researchers constructed a ground-truth dataset of 174 performance time series, manually annotated by eleven Mozilla engineers, for benchmarking. Results indicate that while offline and hybrid CPD methods enhance recall, they significantly reduce precision. However, ensemble voting strategies mitigate this trade-off, achieving an 11% improvement in F1-score and offering more consistent performance. The study validates these findings through a practitioner survey, providing insights for integrating superior methods into Mozilla's performance engineering workflow.

Key takeaway

For MLOps Engineers managing continuous integration performance monitoring, you should move beyond simple statistical tests like the Student's T-test. Explore ensemble change-point detection methods to significantly reduce false positives and missed regressions, thereby improving your system's F1-score and overall reliability. Prioritize validating new detection systems using practitioner-annotated datasets to ensure real-world applicability and trust.

Key insights

Evaluating diverse change-point detection methods and ensembles can significantly improve performance anomaly detection accuracy over traditional statistical tests.

Principles

Current T-test methods yield high false positives/misses.
Offline/hybrid CPD improves recall but reduces precision.
Ensemble voting balances recall and precision effectively.

Method

Construct a practitioner-annotated ground-truth dataset, evaluate diverse CPD methods (offline, online, hybrid), and assess ensemble voting strategies for performance anomaly detection.

In practice

Use practitioner-annotated data for robust CPD benchmarking.
Implement ensemble voting for balanced anomaly detection.
Extend existing benchmarking tools with new CPD methods.

Topics

Performance Anomaly Detection
Change Point Detection
Ensemble Methods
Software Performance Regression
Mozilla Perfherder
Empirical Software Engineering

Code references

alan-turing-institute/AnnotateChange

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.