PROXIMA: A Reliability Scoring Framework for Proxy Metrics in Online Controlled Experiments

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

PROXIMA (Proxy Metric Validation Framework for Online Experiments) is a new diagnostic framework designed to score the reliability of proxy metrics used in large-scale online A/B testing. It addresses the challenge of heterogeneous proxy-outcome relationships across user segments, which can lead to incorrect launch decisions. PROXIMA evaluates proxy reliability across three dimensions: normalized effect correlation, directional accuracy, and segment-level fragility rate. The framework was validated using 80 simulated A/B tests on two public datasets: the Criteo Uplift corpus (14M observations, advertising) and KuaiRec (7K users, video recommendation). Early engagement metrics achieved composite reliabilities of 0.80 on Criteo and 0.62 on KuaiRec, resulting in 98.4% average decision agreement with an oracle policy. Analysis showed recommendation domains have significantly higher segment-level heterogeneity (68% fragility) compared to advertising (13%), though directional accuracy remained high.

Key takeaway

For Research Scientists designing or evaluating online controlled experiments, you should consider integrating PROXIMA to rigorously validate proxy metrics. This framework helps identify and mitigate risks associated with heterogeneous proxy-outcome relationships across user segments, ensuring more reliable ship/no-ship decisions and preventing costly errors that aggregate correlations might obscure. Implementing PROXIMA can significantly improve the accuracy of your experimental outcomes.

Key insights

PROXIMA scores proxy metric reliability in A/B tests by auditing decision accuracy across user segments.

Principles

Method

PROXIMA scores proxy reliability using normalized effect correlation, directional accuracy, and segment-level fragility rate to identify failing user segments.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.