Kaggle Solution Walkthroughs: Home Credit - Credit Risk Model Stability with Team DebtCollectors

· Source: Kaggle · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, long

Summary

A data science team developed a custom loss function and a comprehensive preprocessing pipeline for a competition focused on model stability. Their custom loss aimed to directly optimize the ROC AUC score and its stability across different weeks, addressing the challenge of non-differentiable metrics in standard libraries like LightGBM. This involved calculating the number of inversions within subsets of the training data and deriving analytical first and second derivatives, a labor-intensive process. The preprocessing stage included removing duplicate columns, handling categorical features by limiting categories to 20 and grouping others, applying aggregations to consolidate redundant information, and producing compound features like sums and differences. A rigorous feature selection process, based on common sense and preliminary model weights, ultimately narrowed down the feature set to 262 features, with aggregated birth date emerging as a highly important feature. The team also noted that some individual compound features could achieve ROC AUC scores over 0.7, highlighting their predictive power.

Key takeaway

For machine learning engineers building robust models for competitions or production, consider developing custom loss functions to directly optimize non-differentiable metrics like ROC AUC. Your team should prioritize extensive feature engineering, including aggregation and compound feature creation, as these steps significantly enhance model stability and predictive performance, potentially even outperforming complex models with single features.

Key insights

Directly optimizing non-differentiable metrics like ROC AUC requires custom loss functions and careful feature engineering.

Principles

Method

The proposed method involves a custom loss function that calculates inversions across data subsets to optimize ROC AUC and its stability, coupled with a multi-stage feature engineering pipeline including aggregation and compound feature creation.

In practice

Topics

Best for: AI Scientist, Machine Learning Engineer, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Kaggle.