Building Robust Credit Scoring Models (Part 3)

2026-03-20 · Source: Towards Data Science · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, long

Summary

This article, the third in a series on building robust credit scoring models, details the critical data preprocessing steps of handling outliers and missing values. It uses an open-source Kaggle Credit Scoring Dataset with 32,581 observations and 12 variables to illustrate these techniques. The author first addresses the creation of an artificial "year" variable from `cb_person_cred_hist_length` to enable proper time-based data splitting. The article then explains the importance of splitting data into training (70%), test (30%), and out-of-time (OOT) (2022 data) sets *before* preprocessing to preserve model generalization. Outlier treatment is demonstrated using the IQR method, with a discussion of its applicability and alternatives like Winsorization, especially for variables like `person_age`. Finally, it covers missing value imputation for `person_emp_length` (assigned to 0 for conservative credit scoring) and `loan_int_rate` (imputed with the median), distinguishing between Missing At Random (MAR) and Missing Completely At Random (MCAR) mechanisms.

Key takeaway

For Data Scientists developing credit scoring models, rigorously separating data into training, test, and out-of-time sets *before* any preprocessing is non-negotiable. This ensures that all outlier treatments and missing value imputations are calibrated solely on the training data, preventing data leakage and preserving the model's ability to generalize to new, unseen borrowers. Always replicate these transformations consistently across all datasets to maintain independence and avoid biased performance evaluations.

Key insights

Proper data splitting and preprocessing are crucial for building generalizable and stable credit scoring models.

Principles

Split data before preprocessing to prevent bias.
Calibrate preprocessing only on training data.
Domain expert input is vital for data treatment.

Method

Create a time variable, split data into train/test/OOT, apply IQR-based outlier treatment or Winsorization, and impute missing values based on MAR/MCAR analysis, ensuring all steps are calibrated on training data.

In practice

Use stratified split for train/test sets.
Apply IQR method for outlier capping.
Impute `person_emp_length` with 0 for conservatism.

Topics

Credit Scoring
Data Preprocessing
Outlier Treatment
Missing Value Imputation
Model Validation

Best for: Data Scientist, Machine Learning Engineer, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.