GPT-5.2 Can't Identify a Serial Killer & Was The Year of Agents A Lie? EP99.28-5.2
Summary
OpenAI's GPT-5.2, released with a 400k context window and 128k output, is priced at $1.75 per million input tokens, a 25-cent increase. Initial impressions suggest it is verbose and "vibe-tuned" for benchmarks, with no noticeable improvement over its predecessor, GPT-5.1. The model struggles with chaining tool calls and self-correction, unlike Anthropic's Claude Opus and Google's Gemini 3 Pro, which demonstrate superior agentic capabilities and more sensible responses. A "vision test" revealed GPT-5.2's reluctance to make judgments, even when presented with explicit textual evidence of a convicted serial killer, a task successfully handled by Claude and Gemini. This perceived over-tuning for safety and verbosity is leading real users, like a barber, to switch to alternatives such as Grok 4.1, highlighting a potential brand damage for OpenAI.
Key takeaway
For CTOs and VPs of Engineering evaluating AI models for enterprise deployment, GPT-5.2's performance issues in tool calling and judgment, coupled with its higher cost, suggest caution. Your teams should prioritize models like Claude Opus or Gemini 3 Pro, which demonstrate more robust agentic capabilities and practical reasoning. This shift could prevent workflow bottlenecks and improve the reliability of AI-driven automation, especially for complex, multi-step tasks.
Key insights
GPT-5.2's verbose, over-tuned responses and agentic failures highlight a gap in practical AI application compared to competitors.
Principles
- Over-tuning for safety can degrade practical AI output.
- Agentic workflows require reliable tool chaining and self-correction.
Method
The "Ivan Milat Vision Test" involves presenting an image with explicit text (e.g., "serial killer") and asking the model to assess trustworthiness or make judgments, revealing its commitment to factual inference.
In practice
- Prioritize models like Claude Opus or Gemini 3 Pro for agentic tasks requiring reliable tool calling.
- Evaluate AI models beyond benchmarks by testing their ability to make practical judgments from explicit context.
Topics
- GPT-5.2 Performance
- AI Agentic Workflows
- Large Language Models
- Model Safety
- Enterprise AI Adoption
Best for: CTO, VP of Engineering/Data, Director of AI/ML, Machine Learning Engineer, AI Product Manager, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by This Day in AI Podcast.