AI models are terrible at betting on soccer—especially xAI Grok
Summary
A new study by AI start-up General Reasoning, titled "KellyBench," found that leading AI models from Google, OpenAI, Anthropic, and xAI consistently lost money when betting on the 2023–24 Premier League soccer season. Eight top AI systems were tested in a virtual re-creation, provided with historical data, and instructed to maximize returns and manage risk over a long period. Anthropic’s Claude Opus 4.6 performed best with an average loss of 11 percent, while xAI’s Grok 4.20 went bankrupt in all three attempts. Google’s Gemini 3.1 Pro achieved a 34 percent profit on one try but also experienced bankruptcy. The report highlights AI's struggle with real-world complexity and dynamic, long-horizon tasks, contrasting with its rapid advancements in static tasks like software engineering.
Key takeaway
For AI Product Managers evaluating model deployment in dynamic, real-world environments, this study indicates that current frontier models may underperform significantly compared to their benchmark scores. You should prioritize testing AI systems with long-horizon, adaptive scenarios rather than relying solely on static evaluations to accurately gauge their practical utility and risk.
Key insights
Advanced AI models struggle with dynamic, real-world prediction tasks over extended periods, despite proficiency in static benchmarks.
Principles
- AI performance degrades in dynamic, long-horizon scenarios.
- Static benchmarks misrepresent real-world AI capabilities.
Method
AI agents were given historical soccer data, instructed to build predictive models, and placed bets on match outcomes over a simulated Premier League season to test adaptation.
In practice
- Evaluate AI systems using dynamic, long-horizon benchmarks.
- Recognize AI limitations in complex, real-world decision-making.
Topics
- Large Language Models
- Sports Betting
- AI Benchmarking
- Real-World AI Performance
- KellyBench Report
Best for: Research Scientist, AI Product Manager, AI Scientist, Director of AI/ML, Consultant
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI - Ars Technica.