How to Select Variables Robustly in a Scoring Model

· Source: Towards Data Science · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, medium

Summary

This article details a robust variable selection methodology for credit scoring models, emphasizing stability over raw performance on training data. It introduces a filter method that uses stratified 4-fold cross-validation to ensure selected variables are stable, interpretable, and robust across different data subsets. The process involves four sequential rules: dropping continuous variables not significantly linked to default (Kruskal-Wallis p-value > 5%), dropping categorical variables weakly linked to default (Cramér's V < 10%), dropping redundant continuous variables (Spearman correlation ≥ 60%), and dropping redundant categorical variables (Cramér's V ≥ 50%). The method was applied to the Kaggle Credit Scoring Dataset, comprising 32,581 loans, ultimately selecting 7 variables (5 continuous, 2 categorical) that passed all stability criteria.

Key takeaway

For Data Scientists building credit scoring or similar risk models, focusing on variable stability across data subsets is critical. Your models will be more reliable in production if you implement a cross-validation-based filter method for variable selection, ensuring each chosen feature maintains its significance and non-redundancy across all data folds. This approach reduces the risk of models breaking on new data and improves auditability.

Key insights

Robust variable selection prioritizes stability across data subsets over performance on a single training set.

Principles

Method

Employ stratified k-fold cross-validation to evaluate variable significance and redundancy across multiple data subsets, dropping variables that fail criteria in any single fold.

In practice

Topics

Code references

Best for: Machine Learning Engineer, Data Scientist, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.