When Linear Regression Beats Random Forest: A UK Salary Prediction Case Study
Summary
A UK HR analytics case study explored salary prediction using a synthetic dataset of 720 rows, comparing a mean-predictor baseline, Ordinary Linear Regression, Ridge Regression, and Random Forest models. Contrary to expectations, the linear model significantly outperformed Random Forest, achieving a Mean Absolute Error (MAE) of £4,600 compared to Random Forest's £6,588, a 43% difference. The study emphasizes the importance of a robust ML workflow, including train/test discipline, preprocessing pipelines, and translating model error into business impact. It also highlights the necessity of residual diagnostics to understand error distribution and permutation importance for model interpretability, concluding with a four-step governance framework crucial for deploying salary models in sensitive contexts.
Key takeaway
For Data Scientists building predictive models on small, tabular datasets with largely linear signals, prioritize regularized linear regression before defaulting to complex tree ensembles like Random Forest or XGBoost. Your initial focus should be on establishing a strong baseline and a robust, leakage-safe preprocessing pipeline. Critically, for sensitive applications like salary prediction, ensure comprehensive governance, including DPIAs, fairness audits, human-in-the-loop review, and continuous monitoring, as these factors outweigh raw R² metrics in real-world utility and compliance.
Key insights
Model complexity must be earned by data, as simpler models can outperform complex ones on small datasets.
Principles
- Always establish a baseline model.
- Preprocessing must occur within the pipeline.
- Governance is paramount for sensitive models.
Method
The workflow involves data framing, EDA, train/test splitting, pipeline-based preprocessing, baseline modeling, comparative model evaluation, residual diagnostics, and permutation importance for interpretability.
In practice
- Use `handle_unknown='ignore'` in OneHotEncoder.
- Translate MAE into a percentage of mean salary.
- Conduct a fairness audit for sensitive models.
Topics
- Linear Regression
- Random Forest
- Salary Prediction
- Model Complexity
- Data Governance
Code references
Best for: Data Scientist, HR Professional, Consultant
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Data Science on Medium.