I Improved My Model… But My Score Got Worse
Summary
An analysis of the Ames Housing Competition investigates why model improvements often fail to generalize, despite showing better cross-validation (CV) scores. The study established a robust validation strategy using cross-validation before model building, employing tree-based models like CatBoost, LightGBM, HistGradientBoosting, and XGBoost as baselines. Initial feature engineering, including ordinal encoding, binary presence flags, and interaction features such as "HouseAge = YrSold - YearBuilt", consistently improved model performance. However, advanced techniques like log transformation for skewed numerical variables, category consolidation, and frequency encoding yielded mixed results, demonstrating that increased feature complexity does not guarantee better generalization. A logarithmic transformation of the target sales price improved model stability and XGBoost/HistGradientBoosting leaderboard scores, but not LightGBM or CatBoost. The findings emphasize that CV results alone are insufficient to predict generalization, as some simpler feature sets outperformed more complex configurations on unseen data.
Key takeaway
For data scientists building regression models on tabular data, prioritize a robust cross-validation strategy before extensive feature engineering. You should focus on domain-driven feature transformations, like ordinal encoding or interaction features, as these often generalize better than complex statistical methods. Be wary of solely optimizing for cross-validation scores, as they may not reflect true generalization; always evaluate against unseen data to prevent overfitting during feature engineering.
Key insights
Cross-validation improvements don't guarantee generalization; simpler, domain-driven features often outperform complex ones.
Principles
- Validation strategy must precede model building.
- Domain-driven features often generalize best.
- Model complexity doesn't ensure better generalization.
Method
The process involved establishing cross-validation, building baseline tree models, applying initial domain-driven feature engineering, then advanced statistical feature engineering, and finally target variable transformation.
In practice
- Use "ColumnTransformer" for consistent encoding.
- Encode missing values to capture absence signals.
- Apply ordinal encoding for naturally ranked categories.
Topics
- Feature Engineering
- Cross-Validation
- Model Generalization
- Gradient Boosting
- Tabular Data
- Target Transformation
Best for: Machine Learning Engineer, Data Scientist, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning on Medium.