LOTTERY: Learning from Reference-Only Samples in Two-Sample Testing under Size Asymmetry

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

LOTTERY is a novel method for data-adaptive two-sample testing, specifically addressing few-shot settings characterized by severe sample-size imbalance, where abundant reference samples are available but only a few query samples exist. Unlike traditional data-splitting paradigms that struggle in such scenarios, LOTTERY constructively utilizes the large reference dataset. It learns reference-dependent representations that effectively summarize the salient structure of the reference distribution, providing informative signals for detecting distributional departures. The method incorporates diverse representation families capturing both global and local data structures, adaptively weighting them solely using reference samples through an uncertainty-guided principle. Theoretically, LOTTERY guarantees permutation-based type I error control and demonstrates consistency, with test power converging to one as sample sizes grow, provided the representation set includes at least one consistent representation. Empirically, it achieves strong performance across various benchmarks while preserving type I error control.

Key takeaway

For research scientists performing two-sample testing with severe sample-size imbalance, especially in few-shot scenarios, you should consider LOTTERY. This method offers a robust alternative to traditional data splitting. It leverages abundant reference data to maintain type I error control and achieve strong power. It changes your approach by providing a theoretically sound and empirically effective way to detect distributional shifts when query samples are scarce.

Key insights

LOTTERY enables robust two-sample testing in few-shot, imbalanced settings by learning reference-dependent representations from abundant reference data.

Principles

Method

LOTTERY learns reference-dependent representations from abundant reference samples, capturing global and local structure. It adaptively weights these representation families using an uncertainty-guided principle to detect distributional departures.

In practice

Topics

Best for: AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.