PROXIMA: A Reliability Scoring Framework for Proxy Metrics in Online Controlled Experiments
Summary
PROXIMA (Proxy Metric Validation Framework for Online Experiments) is a new diagnostic framework designed to score the reliability of proxy metrics used in large-scale online A/B testing. It addresses the challenge of heterogeneous proxy-outcome relationships across user segments, which can lead to incorrect launch decisions. PROXIMA evaluates proxy reliability across three dimensions: normalized effect correlation, directional accuracy, and segment-level fragility rate. The framework was validated using 80 simulated A/B tests on two public datasets: the Criteo Uplift corpus (14M observations, advertising) and KuaiRec (7K users, video recommendation). Early engagement metrics achieved composite reliabilities of 0.80 on Criteo and 0.62 on KuaiRec, resulting in 98.4% average decision agreement with an oracle policy. Analysis showed recommendation domains have significantly higher segment-level heterogeneity (68% fragility) compared to advertising (13%), though directional accuracy remained high.
Key takeaway
For Research Scientists designing or evaluating online controlled experiments, you should consider integrating PROXIMA to rigorously validate proxy metrics. This framework helps identify and mitigate risks associated with heterogeneous proxy-outcome relationships across user segments, ensuring more reliable ship/no-ship decisions and preventing costly errors that aggregate correlations might obscure. Implementing PROXIMA can significantly improve the accuracy of your experimental outcomes.
Key insights
PROXIMA scores proxy metric reliability in A/B tests by auditing decision accuracy across user segments.
Principles
- Aggregate correlation can mask segment-level failures.
- Composite scoring improves proxy reliability discrimination.
Method
PROXIMA scores proxy reliability using normalized effect correlation, directional accuracy, and segment-level fragility rate to identify failing user segments.
In practice
- Use PROXIMA to audit proxy metrics in A/B tests.
- Identify user segments where proxies fail.
- Compare proxy reliability across domains.
Topics
- PROXIMA Framework
- Proxy Metrics
- Online Controlled Experiments
- Reliability Scoring
- Segment-level Fragility
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.