GPT-5.2 Can't Identify a Serial Killer & Was The Year of Agents A Lie? EP99.28-5.2

2026-01-03 · Source: This Day in AI Podcast · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Emerging Technologies & Innovation · Depth: Advanced, extended

Summary

OpenAI's GPT-5.2, released with a 400k context window and 128k output, is priced at $1.75 per million input tokens, a 25-cent increase. Initial impressions suggest it is verbose and "vibe-tuned" for benchmarks, with no noticeable improvement over its predecessor, GPT-5.1. The model struggles with chaining tool calls and self-correction, unlike Anthropic's Claude Opus and Google's Gemini 3 Pro, which demonstrate superior agentic capabilities and more sensible responses. A "vision test" revealed GPT-5.2's reluctance to make judgments, even when presented with explicit textual evidence of a convicted serial killer, a task successfully handled by Claude and Gemini. This perceived over-tuning for safety and verbosity is leading real users, like a barber, to switch to alternatives such as Grok 4.1, highlighting a potential brand damage for OpenAI.

Key takeaway

For CTOs and VPs of Engineering evaluating AI models for enterprise deployment, GPT-5.2's performance issues in tool calling and judgment, coupled with its higher cost, suggest caution. Your teams should prioritize models like Claude Opus or Gemini 3 Pro, which demonstrate more robust agentic capabilities and practical reasoning. This shift could prevent workflow bottlenecks and improve the reliability of AI-driven automation, especially for complex, multi-step tasks.

Key insights

GPT-5.2's verbose, over-tuned responses and agentic failures highlight a gap in practical AI application compared to competitors.

Principles

Over-tuning for safety can degrade practical AI output.
Agentic workflows require reliable tool chaining and self-correction.

Method

The "Ivan Milat Vision Test" involves presenting an image with explicit text (e.g., "serial killer") and asking the model to assess trustworthiness or make judgments, revealing its commitment to factual inference.

In practice

Prioritize models like Claude Opus or Gemini 3 Pro for agentic tasks requiring reliable tool calling.
Evaluate AI models beyond benchmarks by testing their ability to make practical judgments from explicit context.

Topics

GPT-5.2 Performance
AI Agentic Workflows
Large Language Models
Model Safety
Enterprise AI Adoption

Best for: CTO, VP of Engineering/Data, Director of AI/ML, Machine Learning Engineer, AI Product Manager, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by This Day in AI Podcast.