NEW Grok 4.3 TESTED: Needs Multiple Iterations

· Source: Discover AI · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Advanced, long

Summary

A live testing session evaluated the new Grok 4.3 AI model against Ernie 5.1 Preview using a complex elevator logic puzzle. The custom test, designed to assess pure reasoning power without external API calls or "AI harness" protocols, required reaching floor 15 under 20 button presses while managing resources like energy and tokens, and navigating various "modes" and "code cards." Initially, Grok 4.3 failed to find a valid sequence, citing ambiguities in the rules, while Ernie 5.1 struggled significantly. In a second attempt, Grok 4.3 produced a solution with 11 button presses plus an emergency exit, which was then validated. A third, optimized run successfully reduced the sequence to 8 button presses plus an emergency exit, meeting the expected performance standard for modern AI models.

Key takeaway

For AI Engineers evaluating new large language models, you should prioritize custom, multi-attempt testing over corporate benchmarks to uncover true reasoning capabilities. Your initial test runs may reveal rule interpretation issues or suboptimal solutions, necessitating iterative refinement and explicit optimization prompts to achieve peak performance, as demonstrated by Grok 4.3's improvement from failure to an 8-press solution.

Key insights

Rigorous, iterative testing reveals an AI model's true reasoning capabilities and optimization potential.

Principles

Method

A custom "elevator logic test" with resource constraints and complex rules is used to evaluate AI models. The process involves initial attempts, validation, and optimization runs to find the shortest sequence.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Discover AI.