Beyond the Straight Line: Choosing Between OLS, Interaction Terms, and Tweedie Regression

2026-06-25 · Source: Towards Data Science · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, long

Summary

This analysis evaluates three regression approaches—Ordinary Least Squares (OLS), OLS with interaction terms, and Tweedie regression—for predicting insurance claim amounts using the French Motor Third-Party Liability Claims dataset. The initial OLS model, a statistical bedrock, yielded a Mean Absolute Error (MAE) of 174.17, struggling with the dataset's zero-inflated nature and non-normal residuals. Introducing interaction terms provided only marginal improvement, resulting in an MAE of 172.24. Tweedie regression, designed for non-negative, zero-inflated, and skewed data, significantly reduced the MAE to 111.97 with optimized parameters (power=1.76, alpha=1.0). The most effective solution presented is a two-step Zero-Inflated Model, combining a LightGBM classifier to predict claim occurrence and a Tweedie regressor for claim severity, achieving the lowest MAE of 87.79.

Key takeaway

For Data Scientists modeling non-negative, zero-inflated outcomes like insurance claims or customer spending, relying solely on OLS is suboptimal. You should first analyze your target variable's distribution. If it exhibits a high concentration of zeroes and a skewed positive tail, consider implementing a Tweedie regression or a two-step zero-inflated model. This approach, demonstrated to reduce MAE by 21% over Tweedie alone, will yield more accurate and interpretable predictions, avoiding impossible negative values.

Key insights

For zero-inflated, non-negative, and skewed target variables, standard OLS is insufficient; specialized models like Tweedie or zero-inflated approaches are superior.

Principles

OLS assumes linear relationships and normally distributed errors.
Interaction terms model feature dependencies.
Tweedie handles zero-inflation and skewed positive values.

Method

The Zero-Inflated Model involves a two-step process: first, a LightGBM classifier predicts the probability of a claim occurring; second, a Tweedie Regressor estimates the claim amount for positive claims, with final prediction as their product.

In practice

Use Tweedie for insurance claims or customer lifetime value.
Combine a classifier and regressor for zero-inflated targets.
Clip outliers and transform skewed data before modeling.

Topics

Tweedie Regression
Zero-Inflated Models
Ordinary Least Squares
LightGBM
Insurance Claims Prediction
Regression Analysis
Machine Learning Models

Code references

gurezende/Zero-Inflated-Tweedie-Regression

Best for: Data Scientist, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.