An adaptive subsampling method for large-sample feature screening
Summary
Cheng Meng introduces BanditCR-SIS, an adaptive subsampling method for model-free feature screening in large-scale ultrahigh-dimensional data. This method addresses the computational burden of existing techniques by proposing CR-SIS, based on Chatterjee's rank correlation, which is more powerful in detecting nonlinear relationships than Pearson correlation-based SIS. BanditCR-SIS reformulates the screening procedure using a multi-armed bandit approach, reducing computational cost from O(nlog(n)p) for CR-SIS to O(sqrt(n)log(n)p+nlog(n)). Both methods establish the sure screening property. Extensive experiments on synthetic and real-world datasets, including CT Slices, demonstrate BanditCR-SIS's superior performance and significantly reduced CPU time compared to classical screening methods like SIS and DC-SIS, particularly in scenarios with heavy-tailed distributions or complex nonlinear relationships.
Key takeaway
For Data Scientists working with ultrahigh-dimensional datasets where features exhibit complex nonlinear relationships, traditional screening methods like SIS or DC-SIS are often insufficient and computationally burdensome. You should consider implementing BanditCR-SIS to achieve superior screening accuracy and significantly reduced computational time. This method's adaptive subsampling and multi-armed bandit approach make it robust and efficient, allowing you to effectively identify important features even with large sample sizes. Adjust the "alpha" parameter to optimize the trade-off between speed and precision for your specific project needs.
Key insights
BanditCR-SIS accelerates model-free feature screening using multi-armed bandits and Chatterjee's rank correlation, drastically cutting computational costs.
Principles
- Chatterjee's rank correlation robustly detects nonlinear relationships.
- Sure screening property guarantees active feature inclusion.
- Iterative feature elimination via subsampling boosts efficiency.
Method
BanditCR-SIS iteratively discards low-correlation features using adaptive subsampling, treating features as multi-armed bandit "arms" to identify the most important ones efficiently.
In practice
- Tune "alpha" to balance screening speed and accuracy.
- Shuffle data once for efficient subsample selection.
Topics
- Feature Screening
- Multi-Armed Bandits
- Chatterjee's Rank Correlation
- Ultrahigh-Dimensional Data
- Computational Efficiency
- Nonlinear Relationships
Best for: Research Scientist, AI Scientist, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.