Kernel Two-Sample Testing via Directional Components Analysis

2026-06-16 · Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

A novel kernel-based two-sample test, Kernel Two-Sample Testing via Directional Components Analysis, is proposed. It utilizes the spectral decomposition of the maximum mean discrepancy (MMD) statistic to identify and use well-estimated directional components in reproducing kernel Hilbert space (RKHS). This approach achieves higher power and improved robustness, particularly in high-dimensional and unbalanced sample settings. The method incorporates information aggregation across multiple kernels and employs a computationally efficient multiplier bootstrap for approximating critical values, which is significantly faster than permutation-based alternatives. Extensive simulations and empirical studies on microarray datasets demonstrate that the method maintains the nominal Type I error rate and delivers superior power compared to other existing MMD-based tests.

Key takeaway

For Research Scientists evaluating distributional differences in high-dimensional or unbalanced datasets, this new kernel-based two-sample test offers superior power and robustness. You should prioritize selecting well-estimated directional components and consider aggregating multiple kernels to enhance inference. The computationally efficient multiplier bootstrap procedure also provides a faster alternative for critical value approximation, streamlining your analysis workflows.

Key insights

Focusing on well-estimated directional components in RKHS improves kernel two-sample test power and robustness.

Principles

Estimation quality of MMD spectral components varies significantly.
Leading eigen-directions are more reliably estimated in finite samples.
Aggregating information across multiple kernels enhances performance.

Method

The method identifies well-estimated directional components via MMD spectral decomposition, aggregates information from multiple kernels, and uses a multiplier bootstrap for critical values.

In practice

Apply the test in high-dimensional and unbalanced data settings.
Use the provided GitHub code for implementation.
Consider multiple kernels for broader component capture.

Topics

Kernel Two-Sample Test
Maximum Mean Discrepancy
Spectral Decomposition
Multiplier Bootstrap
Reproducing Kernel Hilbert Space
High-Dimensional Data

Code references

Best for: AI Scientist, Research Scientist, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.