Aggregate Models, Not Explanations: Improving Feature Importance Estimation

2026-02-13 · Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

A new theoretical analysis reveals that ensembling machine learning models at the model level, rather than aggregating individual model explanations, significantly improves the accuracy of feature importance estimates. This approach, particularly beneficial for expressive models with slow convergence rates, reduces the leading error term, known as excess risk. The study validates these findings on classical benchmarks like Friedman 1, G-function, and Ishigami datasets, using Multi-Layer Perceptrons (MLP) and Random Forest (RF) architectures. Furthermore, a real-world application using UK Biobank proteomic data (n=46,382 participants) to predict Body Mass Index (BMI) demonstrated that model-level ensembling with LightGBM models (achieving an R^2 score of 0.62 +/- 0.001) more accurately identified key metabolic proteins such as FABP4, LEP, ADM, IGFBP-1, and IGFBP-2, compared to aggregating individual model importances.

Key takeaway

Research Scientists developing or deploying complex ML models for scientific discovery, especially in biomedical applications, should prioritize model-level ensembling for feature importance estimation. This strategy, particularly effective for LOCO and SAGE methods, directly reduces model bias and yields more accurate and reliable feature rankings and selections, as demonstrated in proteomic signature identification for BMI. You should consider implementing bagging or voting ensembles to mitigate sampling instability and algorithmic stochasticity, respectively, to improve the robustness of your insights.

Key insights

Ensembling models directly improves feature importance estimation by reducing excess risk, especially for complex ML models.

Principles

Model-level ensembling reduces excess risk more effectively than aggregating explanations.
Excess risk is the primary driver of feature importance inaccuracy for complex models.
Model diversity (lower correlation) enhances ensemble benefits.

Method

The proposed method involves training an ensemble of models (e.g., via bagging or voting) and then deriving feature importance from this aggregated ensemble, rather than averaging importance scores from individual sub-models.

In practice

Use model-level ensembling for LOCO and SAGE methods.
Prioritize bagging for sampling instability, voting for algorithmic stochasticity.
Apply to high-dimensional biomedical data for robust biomarker discovery.

Topics

Feature Importance Estimation
Ensemble Learning
Explainable AI
Excess Risk
Biomedical Applications

Best for: Research Scientist, AI Researcher, AI Scientist, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.