NEW GPT-5.4 Reasoning TEST
Summary
OpenAI's new GBD 5.4 AI model, released on March 5th, 2026, was subjected to a "causal reasoning test" designed for scientific work. The test involved an "elevator problem" requiring the model to find the shortest path from floor 0 to floor 50 using fewer than 20 button presses, with moves outside 0-50 being illegal. The standard GBD 5.4 model, priced at $2.5 per million input tokens and $15 per million output tokens (significantly cheaper than the Pro version's $180 output price), repeatedly failed this task. Across multiple attempts, including self-verification runs, the model either could not reach floor 50, landed on an incorrect floor (e.g., 46), or proposed illegal moves (e.g., floor 53). It ultimately concluded that no valid path exists under the given rules without clarification, even when prompted to accept its own default rule sets. The analyst plans to test the "high" version of GBD 5.4 next.
Key takeaway
For AI Engineers evaluating new large language models for scientific or constraint-based problem-solving, you should rigorously test base versions like GBD 5.4 with specific, non-trivial reasoning tasks. Do not assume basic models can handle complex logical constraints or pathfinding without explicit rule clarification or resorting to higher-tier, more expensive versions. Your initial assessment should include edge cases and implicit rule adherence to avoid deployment failures.
Key insights
GBD 5.4 struggles with complex causal reasoning and constraint satisfaction in a simple mathematical puzzle.
Principles
- Model performance varies significantly across versions (e.g., standard vs. high).
- Explicit rule clarification can be critical for model task execution.
Method
A "causal reasoning test" involving an elevator pathfinding problem with specific floor and button press constraints was used to evaluate model capabilities.
In practice
- Test base models before assuming suitability for complex tasks.
- Consider higher-tier models for reasoning-intensive applications.
Topics
- GBD 5.4
- AI Model Evaluation
- Causal Reasoning
- Large Language Models
- Model Performance
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Discover AI.