NEW Grok 4.3 TESTED: Needs Multiple Iterations
Summary
A live testing session evaluated the new Grok 4.3 AI model against Ernie 5.1 Preview using a complex elevator logic puzzle. The custom test, designed to assess pure reasoning power without external API calls or "AI harness" protocols, required reaching floor 15 under 20 button presses while managing resources like energy and tokens, and navigating various "modes" and "code cards." Initially, Grok 4.3 failed to find a valid sequence, citing ambiguities in the rules, while Ernie 5.1 struggled significantly. In a second attempt, Grok 4.3 produced a solution with 11 button presses plus an emergency exit, which was then validated. A third, optimized run successfully reduced the sequence to 8 button presses plus an emergency exit, meeting the expected performance standard for modern AI models.
Key takeaway
For AI Engineers evaluating new large language models, you should prioritize custom, multi-attempt testing over corporate benchmarks to uncover true reasoning capabilities. Your initial test runs may reveal rule interpretation issues or suboptimal solutions, necessitating iterative refinement and explicit optimization prompts to achieve peak performance, as demonstrated by Grok 4.3's improvement from failure to an 8-press solution.
Key insights
Rigorous, iterative testing reveals an AI model's true reasoning capabilities and optimization potential.
Principles
- Pure reasoning power requires isolating models from external aids.
- Iterative refinement can significantly improve AI model performance.
- Ambiguity in prompts can lead to initial task failure.
Method
A custom "elevator logic test" with resource constraints and complex rules is used to evaluate AI models. The process involves initial attempts, validation, and optimization runs to find the shortest sequence.
In practice
- Design custom benchmarks for specific reasoning tasks.
- Iterate on prompts and model interactions to refine solutions.
- Validate AI outputs against all specified constraints.
Topics
- Grok 4.3 Performance
- AI Model Benchmarking
- Complex Logic Puzzles
- Iterative Reasoning
- Constraint Satisfaction
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Discover AI.