When AI goes haywire: the case of the skyscraper and the slide trombone

2026-02-08 · Source: Artificial intelligence (AI) – The Conversation · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Intermediate, medium

Summary

Generative AI, exemplified by models like ChatGPT, Gemini, and Mistral, has achieved unprecedented adoption since its launch in November 2022, with ChatGPT alone reportedly reaching 800 million weekly active users. Despite its widespread use and ability to perform complex tasks such as passing bar exams or interpreting medical scans, these models exhibit a fundamental lack of common sense and understanding of the physical world. Experiments involving prompts to generate images of vastly different-sized objects side-by-side, such as a skyscraper and a trombone, consistently produce illogical results where objects are depicted at similar scales. This limitation stems from their statistical learning approach, where diffusion models are trained on image-text pairs and lack an internal representation of concepts like "compare" or the relative dimensions of objects not frequently co-occurring in their training data. The models' reliance on statistical inference, rather than logical reasoning, leads to "glitches" that highlight their inability to grasp real-world context.

Key takeaway

For AI Scientists evaluating generative models, recognize that current systems like Gemini and Mistral, despite advanced capabilities, fundamentally lack common sense and a logical world model. Your evaluation should include tests that push beyond learned patterns, such as comparing unrelated objects or complex logical queries, to identify critical limitations in contextual understanding and prevent "off the mark" outputs in real-world applications.

Key insights

Generative AI lacks common sense and real-world understanding, relying solely on statistical patterns from training data.

Principles

AI results are based on learned data patterns.
Models lack internal representation of concepts.
Statistical inference can lead to logical glitches.

Method

Diffusion models generate images by reversing a noise addition process, trained on image-text pairs, but struggle with novel object comparisons due to a lack of contextual understanding.

In practice

Test AI with prompts combining disparate objects.
Verify AI-generated content for logical consistency.

Topics

Generative AI
AI Limitations
Diffusion Models
Large Language Models
AI Training Data

Best for: Computer Vision Engineer, AI Scientist, Research Scientist, AI Engineer, Data Scientist, Tech Journalist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial intelligence (AI) – The Conversation.