[D] Risk of using XGB models

2026-03-24 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, short

Summary

A junior data scientist, working as a model risk auditor at a non-banking financial institution, is seeking counter-arguments against their company's model validation team regarding complex ensemble models. The company uses XGBoost feeder models for customer segmentation (e.g., bureau thick/thin/NTC) in loan applications (Farm, Two-wheeler, Personal, Consumer Durable). These feeder models' outputs are converted to scores, passed through a sigmoid function to obtain logits, and then fed into a final logistic model with static coefficients to predict the probability of default. The auditor observed that some variables in the feeder models are statistically insignificant or weak predictors (Information Value < 2%), but the validation team dismisses this, citing the ensemble's aggregated output. The auditor is looking for technical arguments to challenge this rationale, especially since the team does not check for VIF or use LIME/SHAP for interpretability.

Key takeaway

For data scientists auditing complex ensemble models, you should challenge the assumption that aggregation negates the risk of weak features. Focus on demonstrating potential overfitting by testing on recent data splits and analyzing feature importance with methods like SAGE, rather than VIF. Your audit should also include PSI and KS checks to identify data drift, providing concrete evidence for model instability.

Key insights

Weak features in ensemble models can lead to overfitting and reduced robustness, despite aggregation.

Principles

Simpler models are more robust to noise and distribution changes.
Low univariate importance does not preclude multivariate importance.
SHAP for feature importance is theoretically flawed; SAGE is an alternative.

Method

Audit tree-based models by assessing Population Stability Index (PSI) and Kolmogorov-Smirnov (KS) statistics between recent and training populations to detect data drift and population changes.

In practice

Test models on newest data using chronological splits.
Use Boruta algorithm or Mutual Information for irrelevant variable checks.
Construct examples where models fail due to weak features.

Topics

Model Risk Audit
Ensemble Models
XGBoost
Feature Importance
Data Drift

Code references

iancovert/sage

Best for: Data Scientist, AI Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.