Expected Goals, End-to-End: An MLOps Pipeline for Hockey on Databricks

· Source: Machine Learning on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, medium

Summary

This article details an end-to-end MLOps pipeline for an Expected Goals (xG) model in hockey, implemented on Databricks. The pipeline begins with generating 50,000 synthetic shots, ensuring a reproducible dataset with a known ground-truth goal probability and an overall goal rate of 9.42%. Raw shot data is transformed into features like distance, angle, and shot type. Three models—logistic regression, random forest, and XGBoost—are trained and tracked using MLflow, with logistic regression emerging as the champion due to its lowest log-loss of 0.2753. The winning model is then governed in Unity Catalog using semantic aliases like "@champion" for seamless promotion and rollback. A real-time serving endpoint, configured for scale-to-zero and automatic inference logging, deploys the model. Finally, Lakehouse Monitoring detects data drift, exemplified by a simulated shift in shot distribution that caused the goal rate to drop to 2.46%, providing granular insights into the source of the drift.

Key takeaway

For MLOps Engineers building robust production systems, prioritize comprehensive infrastructure over model complexity. Your focus should be on establishing governed model registries, automated serving endpoints with inference logging, and granular data drift monitoring. This ensures your models remain reliable and performant in dynamic environments, allowing you to quickly identify and address shifts in input data distributions rather than just detecting general failures.

Key insights

Production-ready MLOps infrastructure, not model complexity, ensures reliable and governed machine learning deployments.

Principles

Method

Generate synthetic data, engineer features, train and log models with MLflow, register champion in Unity Catalog using aliases, deploy a scale-to-zero serving endpoint, and monitor for data drift with granular insights.

In practice

Topics

Code references

Best for: MLOps Engineer, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning on Medium.