5 frontier AI models were asked to code bots to navigate a foggy maze with teleportals. 1st to the exit wins. Over 500 steps and you're eliminated. Gemini, ChatGPT, and Mimo bots never made it past round 8. Here's Claude's and Grok's bots playing Round 93.
Summary
Five frontier AI models, including Gemini, ChatGPT, Mimo, Claude, and Grok, were tasked with coding bots to navigate a complex, foggy maze featuring teleportals. The objective was for bots to reach an exit in the fewest possible steps, with a strict elimination limit of 500 steps per round. Bots operated with only a 5x5 window of visibility around their current position, requiring them to build internal "mental maps" from partial observations. While Gemini, ChatGPT, and Mimo bots were eliminated before Round 9, Claude's and Grok's bots demonstrated superior performance, advancing to Round 93 in the ongoing competition. This setup provides a novel method for comparing AI capabilities beyond standard maze-solving tasks.
Key takeaway
For AI scientists evaluating model capabilities, consider designing competitive coding challenges that introduce partial observability and dynamic environmental elements like teleportals. This approach moves beyond standard, fully observable problems, providing a more robust assessment of an AI's ability to infer, plan, and adapt under uncertainty, which is crucial for developing more sophisticated autonomous agents.
Key insights
AI models can be compared by tasking them to code bots for complex, partially observable navigation challenges.
Principles
- Partial observation forces AI to infer.
- Teleportals add complexity to pathfinding.
Method
Bots navigate a foggy maze with teleportals, building a mental map from a 5x5 observation window, aiming for the exit within 500 steps.
In practice
- Design tasks with partial information.
- Introduce non-linear movement mechanics.
Topics
- Frontier AI Models
- Maze Solving
- Partial Observability
- AI Benchmarking
- Large Language Models
Best for: Machine Learning Engineer, AI Scientist, Research Scientist, AI Researcher, AI Engineer, Prompt Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.