Feature selection leads to divergent neurobiological interpretations of brain-based machine learning biomarkers
Summary
A study involving over 12,000 participants across four large-scale neuroimaging datasets (HBN, ABCD, HCPD, PNC) and 13 outcomes demonstrates that univariate feature selection in brain-based machine learning models can lead to incomplete and potentially misleading neurobiological interpretations. Researchers found that features typically discarded by selection methods can achieve significant prediction accuracies, often comparable to those of top-ranked features, across cognitive, developmental, and psychiatric phenotypes. These results hold for both functional connectivity (fMRI) and structural (diffusion tensor imaging) connectomes and are robust in external validation. The findings suggest that focusing solely on the most prominent features oversimplifies the complex, widely distributed neural circuits underlying brain-behavior associations, potentially contributing to reproducibility issues in the field. The study reinforces the importance of considering subtle, brain-wide signals.
Key takeaway
For AI Scientists and Research Scientists developing brain-based predictive models, you should critically re-evaluate reliance on univariate feature selection. Your models may be overlooking significant, complementary neurobiological signals that offer comparable predictive power and could reveal distinct patient subtypes. Consider exploring lower-ranked feature sets to gain a more comprehensive understanding of brain-behavior associations and to identify novel, anatomically accessible targets for intervention, thereby improving model generalizability and clinical utility.
Key insights
Discarded brain features can predict phenotypes with accuracy comparable to top-ranked features, yielding divergent neurobiological interpretations.
Principles
- Brain-behavior associations are widely distributed.
- Feature selection can oversimplify neurobiological complexity.
- Multiple neurobiologically distinct models may exist for a given phenotype.
Method
A decile-based feature ranking paradigm was used, partitioning connectome features into ten non-overlapping subsets based on their association strength with a target phenotype, then evaluating each subset's predictive accuracy.
In practice
- Explore lower-ranked features for alternative therapeutic targets.
- Consider multiple feature sets for identifying patient subtypes.
- Use ridge regression for high-dimensional connectivity data.
Topics
- Feature Selection
- Neurobiological Interpretation
- Brain-Behavior Biomarkers
- Connectome-based Predictive Modeling
- Neuroimaging Data Analysis
Code references
Best for: Research Scientist, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine learning : nature.com subject feeds.