Building Robust Credit Scoring Models (Part 3)
Summary
This article, the third in a series on building robust credit scoring models, details the critical data preprocessing steps of handling outliers and missing values. It uses an open-source Kaggle Credit Scoring Dataset with 32,581 observations and 12 variables to illustrate these techniques. The author first addresses the creation of an artificial "year" variable from `cb_person_cred_hist_length` to enable proper time-based data splitting. The article then explains the importance of splitting data into training (70%), test (30%), and out-of-time (OOT) (2022 data) sets *before* preprocessing to preserve model generalization. Outlier treatment is demonstrated using the IQR method, with a discussion of its applicability and alternatives like Winsorization, especially for variables like `person_age`. Finally, it covers missing value imputation for `person_emp_length` (assigned to 0 for conservative credit scoring) and `loan_int_rate` (imputed with the median), distinguishing between Missing At Random (MAR) and Missing Completely At Random (MCAR) mechanisms.
Key takeaway
For Data Scientists developing credit scoring models, rigorously separating data into training, test, and out-of-time sets *before* any preprocessing is non-negotiable. This ensures that all outlier treatments and missing value imputations are calibrated solely on the training data, preventing data leakage and preserving the model's ability to generalize to new, unseen borrowers. Always replicate these transformations consistently across all datasets to maintain independence and avoid biased performance evaluations.
Key insights
Proper data splitting and preprocessing are crucial for building generalizable and stable credit scoring models.
Principles
- Split data before preprocessing to prevent bias.
- Calibrate preprocessing only on training data.
- Domain expert input is vital for data treatment.
Method
Create a time variable, split data into train/test/OOT, apply IQR-based outlier treatment or Winsorization, and impute missing values based on MAR/MCAR analysis, ensuring all steps are calibrated on training data.
In practice
- Use stratified split for train/test sets.
- Apply IQR method for outlier capping.
- Impute `person_emp_length` with 0 for conservatism.
Topics
- Credit Scoring
- Data Preprocessing
- Outlier Treatment
- Missing Value Imputation
- Model Validation
Best for: Data Scientist, Machine Learning Engineer, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.