NEW Grok 4.3 TESTED: Needs Multiple Iterations

2026-05-01 · Source: Discover AI · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Advanced, long

Summary

A live testing session evaluated the new Grok 4.3 AI model against Ernie 5.1 Preview using a complex elevator logic puzzle. The custom test, designed to assess pure reasoning power without external API calls or "AI harness" protocols, required reaching floor 15 under 20 button presses while managing resources like energy and tokens, and navigating various "modes" and "code cards." Initially, Grok 4.3 failed to find a valid sequence, citing ambiguities in the rules, while Ernie 5.1 struggled significantly. In a second attempt, Grok 4.3 produced a solution with 11 button presses plus an emergency exit, which was then validated. A third, optimized run successfully reduced the sequence to 8 button presses plus an emergency exit, meeting the expected performance standard for modern AI models.

Key takeaway

For AI Engineers evaluating new large language models, you should prioritize custom, multi-attempt testing over corporate benchmarks to uncover true reasoning capabilities. Your initial test runs may reveal rule interpretation issues or suboptimal solutions, necessitating iterative refinement and explicit optimization prompts to achieve peak performance, as demonstrated by Grok 4.3's improvement from failure to an 8-press solution.

Key insights

Rigorous, iterative testing reveals an AI model's true reasoning capabilities and optimization potential.

Principles

Pure reasoning power requires isolating models from external aids.
Iterative refinement can significantly improve AI model performance.
Ambiguity in prompts can lead to initial task failure.

Method

A custom "elevator logic test" with resource constraints and complex rules is used to evaluate AI models. The process involves initial attempts, validation, and optimization runs to find the shortest sequence.

In practice

Design custom benchmarks for specific reasoning tasks.
Iterate on prompts and model interactions to refine solutions.
Validate AI outputs against all specified constraints.

Topics

Grok 4.3 Performance
AI Model Benchmarking
Complex Logic Puzzles
Iterative Reasoning
Constraint Satisfaction

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Discover AI.