BaseballIQ: How I Built an MLB Intelligence Platform With Data Engineering, ML, and AI

· Source: Data Engineering on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Software Development & Engineering · Depth: Intermediate, medium

Summary

BaseballIQ is a live MLB analytics platform integrating a production-grade data pipeline, a validated machine learning model, a custom UI, and Claude-powered scouting reports. It utilizes a medallion architecture, processing raw Statcast data into Bronze (raw), Silver (transformed with DuckDB for aggregations and features), and Gold (analytics-ready summary tables). The platform employs an XGBoost model to predict CSW rate, a key pitcher quality metric, using TimeSeriesSplit for robust validation. A custom-designed UI with a "luxury front office editorial aesthetic" enhances trust and readability. Furthermore, it incorporates Anthropic's Claude to generate auditable scouting reports, synthesizing pre-computed statistics and ML-driven insights like concern flags, rather than performing calculations itself. The entire project is publicly available on GitHub and deployed on Railway.

Key takeaway

For Data Scientists or ML Engineers building analytical platforms, prioritize robust data pipelines and validation. Your project's credibility and utility will significantly increase by moving beyond descriptive analytics to integrate ML for predictive insights and LLMs for narrative synthesis. Focus on design and explainability to ensure your outputs are trusted and actionable by domain experts, making your work indispensable.

Key insights

A robust analytics platform integrates data engineering, ML, design, and LLMs to provide actionable insights beyond descriptive statistics.

Principles

Method

Implement a medallion architecture (Bronze, Silver, Gold) for data pipelines. Use DuckDB for efficient SQL transformations. Train XGBoost models with TimeSeriesSplit validation. Integrate LLMs to synthesize pre-computed ML outputs into narrative reports.

In practice

Topics

Code references

Best for: Data Engineer, Machine Learning Engineer, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Data Engineering on Medium.