When Linear Regression Beats Random Forest: A UK Salary Prediction Case Study

2026-05-18 · Source: Data Science on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Human Resources & Workforce Development · Depth: Intermediate, long

Summary

A UK HR analytics case study explored salary prediction using a synthetic dataset of 720 rows, comparing a mean-predictor baseline, Ordinary Linear Regression, Ridge Regression, and Random Forest models. Contrary to expectations, the linear model significantly outperformed Random Forest, achieving a Mean Absolute Error (MAE) of £4,600 compared to Random Forest's £6,588, a 43% difference. The study emphasizes the importance of a robust ML workflow, including train/test discipline, preprocessing pipelines, and translating model error into business impact. It also highlights the necessity of residual diagnostics to understand error distribution and permutation importance for model interpretability, concluding with a four-step governance framework crucial for deploying salary models in sensitive contexts.

Key takeaway

For Data Scientists building predictive models on small, tabular datasets with largely linear signals, prioritize regularized linear regression before defaulting to complex tree ensembles like Random Forest or XGBoost. Your initial focus should be on establishing a strong baseline and a robust, leakage-safe preprocessing pipeline. Critically, for sensitive applications like salary prediction, ensure comprehensive governance, including DPIAs, fairness audits, human-in-the-loop review, and continuous monitoring, as these factors outweigh raw R² metrics in real-world utility and compliance.

Key insights

Model complexity must be earned by data, as simpler models can outperform complex ones on small datasets.

Principles

Always establish a baseline model.
Preprocessing must occur within the pipeline.
Governance is paramount for sensitive models.

Method

The workflow involves data framing, EDA, train/test splitting, pipeline-based preprocessing, baseline modeling, comparative model evaluation, residual diagnostics, and permutation importance for interpretability.

In practice

Use `handle_unknown='ignore'` in OneHotEncoder.
Translate MAE into a percentage of mean salary.
Conduct a fairness audit for sensitive models.

Topics

Linear Regression
Random Forest
Salary Prediction
Model Complexity
Data Governance

Code references

jumma786/uk-salary-regression

Best for: Data Scientist, HR Professional, Consultant

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Data Science on Medium.