AI models are terrible at betting on soccer—especially xAI Grok

· Source: AI - Ars Technica · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Intermediate, short

Summary

A new study by AI start-up General Reasoning, titled "KellyBench," found that leading AI models from Google, OpenAI, Anthropic, and xAI consistently lost money when betting on the 2023–24 Premier League soccer season. Eight top AI systems were tested in a virtual re-creation, provided with historical data, and instructed to maximize returns and manage risk over a long period. Anthropic’s Claude Opus 4.6 performed best with an average loss of 11 percent, while xAI’s Grok 4.20 went bankrupt in all three attempts. Google’s Gemini 3.1 Pro achieved a 34 percent profit on one try but also experienced bankruptcy. The report highlights AI's struggle with real-world complexity and dynamic, long-horizon tasks, contrasting with its rapid advancements in static tasks like software engineering.

Key takeaway

For AI Product Managers evaluating model deployment in dynamic, real-world environments, this study indicates that current frontier models may underperform significantly compared to their benchmark scores. You should prioritize testing AI systems with long-horizon, adaptive scenarios rather than relying solely on static evaluations to accurately gauge their practical utility and risk.

Key insights

Advanced AI models struggle with dynamic, real-world prediction tasks over extended periods, despite proficiency in static benchmarks.

Principles

Method

AI agents were given historical soccer data, instructed to build predictive models, and placed bets on match outcomes over a simulated Premier League season to test adaptation.

In practice

Topics

Best for: Research Scientist, AI Product Manager, AI Scientist, Director of AI/ML, Consultant

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI - Ars Technica.