How to Select Variables Robustly in a Scoring Model
Summary
This article details a robust variable selection methodology for credit scoring models, emphasizing stability over raw performance on training data. It introduces a filter method that uses stratified 4-fold cross-validation to ensure selected variables are stable, interpretable, and robust across different data subsets. The process involves four sequential rules: dropping continuous variables not significantly linked to default (Kruskal-Wallis p-value > 5%), dropping categorical variables weakly linked to default (Cramér's V < 10%), dropping redundant continuous variables (Spearman correlation ≥ 60%), and dropping redundant categorical variables (Cramér's V ≥ 50%). The method was applied to the Kaggle Credit Scoring Dataset, comprising 32,581 loans, ultimately selecting 7 variables (5 continuous, 2 categorical) that passed all stability criteria.
Key takeaway
For Data Scientists building credit scoring or similar risk models, focusing on variable stability across data subsets is critical. Your models will be more reliable in production if you implement a cross-validation-based filter method for variable selection, ensuring each chosen feature maintains its significance and non-redundancy across all data folds. This approach reduces the risk of models breaking on new data and improves auditability.
Key insights
Robust variable selection prioritizes stability across data subsets over performance on a single training set.
Principles
- Variables must be stable across all data folds.
- Redundant variables degrade model performance.
- Auditable selection enhances model trustworthiness.
Method
Employ stratified k-fold cross-validation to evaluate variable significance and redundancy across multiple data subsets, dropping variables that fail criteria in any single fold.
In practice
- Use Kruskal-Wallis for continuous variable-target links.
- Apply Cramér's V for categorical variable associations.
- Check Spearman correlation for continuous variable redundancy.
Topics
- Variable Selection
- Scoring Models
- Robustness
- Cross-Validation
- Filter Method
Code references
Best for: Machine Learning Engineer, Data Scientist, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.