[D] Risk of using XGB models
Summary
A junior data scientist, working as a model risk auditor at a non-banking financial institution, is seeking counter-arguments against their company's model validation team regarding complex ensemble models. The company uses XGBoost feeder models for customer segmentation (e.g., bureau thick/thin/NTC) in loan applications (Farm, Two-wheeler, Personal, Consumer Durable). These feeder models' outputs are converted to scores, passed through a sigmoid function to obtain logits, and then fed into a final logistic model with static coefficients to predict the probability of default. The auditor observed that some variables in the feeder models are statistically insignificant or weak predictors (Information Value < 2%), but the validation team dismisses this, citing the ensemble's aggregated output. The auditor is looking for technical arguments to challenge this rationale, especially since the team does not check for VIF or use LIME/SHAP for interpretability.
Key takeaway
For data scientists auditing complex ensemble models, you should challenge the assumption that aggregation negates the risk of weak features. Focus on demonstrating potential overfitting by testing on recent data splits and analyzing feature importance with methods like SAGE, rather than VIF. Your audit should also include PSI and KS checks to identify data drift, providing concrete evidence for model instability.
Key insights
Weak features in ensemble models can lead to overfitting and reduced robustness, despite aggregation.
Principles
- Simpler models are more robust to noise and distribution changes.
- Low univariate importance does not preclude multivariate importance.
- SHAP for feature importance is theoretically flawed; SAGE is an alternative.
Method
Audit tree-based models by assessing Population Stability Index (PSI) and Kolmogorov-Smirnov (KS) statistics between recent and training populations to detect data drift and population changes.
In practice
- Test models on newest data using chronological splits.
- Use Boruta algorithm or Mutual Information for irrelevant variable checks.
- Construct examples where models fail due to weak features.
Topics
- Model Risk Audit
- Ensemble Models
- XGBoost
- Feature Importance
- Data Drift
Code references
Best for: Data Scientist, AI Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.