Revealing Geography-Driven Signals in Zone-Level Claim Frequency Models: An Empirical Study using Environmental and Visual Predictors

2025-02-14 · Source: stat.ML updates on arXiv.org · Field: Finance & Economics — Insurance & Risk Management, Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, extended

Summary

This study investigates how geographic information from alternative data sources can enhance Motor Third Party Liability (MTPL) claim frequency prediction in actuarial models, particularly when public datasets offer limited location identifiers. Researchers used the BeMTPL97 dataset, aggregating 163,212 policyholders into 583 unique postcodes for zone-level modeling. They incorporated environmental indicators from OpenStreetMap (2014) and CORINE Land Cover 2000, alongside 1995 orthoimagery from the Belgian National Geographic Institute. The predictive contributions of coordinates, environmental features, and image embeddings were evaluated across Generalized Linear Models (GLMs), regularized GLMs, and gradient-boosted trees (XGBoost), with raw imagery processed by Convolutional Neural Networks (CNNs). Results show that augmenting actuarial variables with constructed geographic data improves predictive accuracy, with linear and tree-based models benefiting most from combining coordinates with 5 km scale environmental features. Image embeddings, especially pretrained vision-transformer embeddings like Nomic-embed-vision-v1.5, improved accuracy and stability when structured environmental features were unavailable.

Key takeaway

For AI Scientists developing actuarial models for MTPL claim frequency, integrating constructed geographic information is critical. Your models will achieve superior predictive accuracy and stability by combining traditional actuarial variables with geographic coordinates and environmental features, particularly those extracted at a 5 km radius. Consider using pretrained vision-transformer embeddings as a robust alternative when explicit environmental data is limited, as they can significantly enhance model performance and stability.

Key insights

Geographic context, derived from alternative data, significantly improves MTPL claim frequency prediction in actuarial models.

Principles

Geographic context enhances actuarial risk assessment.
Optimal spatial scale is crucial for feature effectiveness.
Pretrained vision transformers offer valuable proxies for geographic context.

Method

The study employs a zone-level modeling framework, aggregating policy data by postcode. It integrates environmental features from OpenStreetMap and CORINE Land Cover, and visual features from historical orthoimagery, using GLMs, regularized GLMs, XGBoost, and CNNs for claim frequency prediction.

In practice

Combine coordinates with 5 km environmental features for GLMs.
Utilize Nomic-v1.5 embeddings when structured environmental data is scarce.
Apply multi-scale environmental features with XGBoost for better performance.

Topics

Motor Third Party Liability
Claim Frequency Prediction
Geospatial Data
OpenStreetMap
Remote Sensing

Best for: AI Scientist, Research Scientist, Data Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.