BaseballIQ: How I Built an MLB Intelligence Platform With Data Engineering, ML, and AI
Summary
BaseballIQ is a live MLB analytics platform integrating a production-grade data pipeline, a validated machine learning model, a custom UI, and Claude-powered scouting reports. It utilizes a medallion architecture, processing raw Statcast data into Bronze (raw), Silver (transformed with DuckDB for aggregations and features), and Gold (analytics-ready summary tables). The platform employs an XGBoost model to predict CSW rate, a key pitcher quality metric, using TimeSeriesSplit for robust validation. A custom-designed UI with a "luxury front office editorial aesthetic" enhances trust and readability. Furthermore, it incorporates Anthropic's Claude to generate auditable scouting reports, synthesizing pre-computed statistics and ML-driven insights like concern flags, rather than performing calculations itself. The entire project is publicly available on GitHub and deployed on Railway.
Key takeaway
For Data Scientists or ML Engineers building analytical platforms, prioritize robust data pipelines and validation. Your project's credibility and utility will significantly increase by moving beyond descriptive analytics to integrate ML for predictive insights and LLMs for narrative synthesis. Focus on design and explainability to ensure your outputs are trusted and actionable by domain experts, making your work indispensable.
Key insights
A robust analytics platform integrates data engineering, ML, design, and LLMs to provide actionable insights beyond descriptive statistics.
Principles
- Validate ML models with TimeSeriesSplit for temporal data.
- Design aesthetics are crucial for user trust and information delivery.
- LLMs should narrate pre-computed data, not calculate it.
Method
Implement a medallion architecture (Bronze, Silver, Gold) for data pipelines. Use DuckDB for efficient SQL transformations. Train XGBoost models with TimeSeriesSplit validation. Integrate LLMs to synthesize pre-computed ML outputs into narrative reports.
In practice
- Use DuckDB for complex SQL on millions of rows.
- Apply TimeSeriesSplit for time-series ML validation.
- Craft custom CSS for a professional dashboard aesthetic.
Topics
- Data Engineering
- XGBoost
- LLM Applications
- Sports Analytics
- Streamlit
Code references
Best for: Data Engineer, Machine Learning Engineer, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Data Engineering on Medium.