Data Leakage in Machine Learning: Why You Must Split Before Preprocessing
Summary
Data leakage in machine learning, a common issue where information from the test set inadvertently influences the training process, can lead to inflated evaluation scores and poor real-world model performance. This problem often arises when preprocessing steps like data scaling, imputation, dimensionality reduction (PCA), feature engineering, or resampling (SMOTE) are performed on the entire dataset before it is split into training and testing sets. The article emphasizes that any step calculating statistics or extracting patterns from data constitutes "learning" and must occur only on the training set. It illustrates the "Wall Strategy" for correct workflow, where data is split first, the scaler is fitted only on the training data, and then applied to the test data. The content also addresses advanced leakage scenarios in cross-validation, time-series, and grouped data, recommending `sklearn.pipeline.Pipeline`, `TimeSeriesSplit`, and `GroupKFold` respectively.
Key takeaway
For Machine Learning Engineers and Data Scientists building production-ready systems, ensuring data integrity is paramount. You must always split your dataset into training and testing sets *before* performing any preprocessing steps that learn from the data, such as scaling or imputation. Failing to do so introduces data leakage, leading to models with deceptively high evaluation metrics that will perform poorly in real-world deployment. Implement `sklearn.pipeline.Pipeline` for robust cross-validation and consider `TimeSeriesSplit` or `GroupKFold` for specialized datasets to prevent silent invalidation of your model's performance.
Key insights
Preprocessing steps that learn from data must occur after splitting to prevent data leakage and ensure valid model evaluation.
Principles
- The test set must remain unseen.
- Any step computing statistics is learning from data.
- Accuracy is meaningless if the workflow is wrong.
Method
The "Wall Strategy" involves splitting data first, fitting preprocessing steps exclusively on the training set, and then transforming the test set using the training set's learned parameters.
In practice
- Use `sklearn.pipeline.Pipeline` for cross-validation.
- Apply `TimeSeriesSplit` for time-series data.
- Implement `GroupKFold` for grouped entity data.
Topics
- Data Leakage
- Data Preprocessing
- Train-Test Split
- Cross-Validation
- Scikit-learn Pipelines
Best for: Machine Learning Engineer, Data Scientist, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards AI - Medium.