OpenAI's new GPT-5.4 clobbers humans on pro-level work in tests - by 83%
Summary
OpenAI has released GPT-5.4, its "Thinking" model, less than three months after GPT-5.2. This new iteration achieves an 83% score on OpenAI's GPTval evaluation, indicating it matches or outperforms human professionals across 44 real-world occupations in nine industries. GPT-5.4 is also 18% less likely to contain errors and 33% less likely to produce false claims compared to GPT-5.2. The model is rolling out across ChatGPT paid tiers, the API, and the Codex programming tool. Key enhancements include improved tool use, enhanced computer vision for interpreting complex images and documents, native computer-use capabilities for interacting with software via screenshots and commands, and stronger coding abilities by integrating GPT-5.3-Codex's strengths.
Key takeaway
For Machine Learning Engineers evaluating frontier models for enterprise deployment, GPT-5.4's 83% professional performance score and reduced error rates signal a significant leap in reliability and capability. You should prioritize testing its enhanced tool use, computer vision, and native computer interaction features to identify specific high-value automation opportunities within your organization, particularly for complex, multi-step workflows.
Key insights
GPT-5.4 demonstrates expert-level performance across diverse professional tasks, significantly advancing AI capabilities in real-world applications.
Principles
- AI evaluation should reflect economically valuable, real-world tasks.
- Human expert grading is crucial for validating AI performance.
Method
OpenAI's GPTval test assesses AI performance by having human professionals create and grade complex, day-to-day tasks across 44 occupations in nine industries, with automated grading systems built from human expert input.
In practice
- Integrate GPT-5.4 via API for complex professional workflows.
- Utilize GPT-5.4's coding enhancements for software development.
- Explore GPT-5.4's computer vision for document parsing.
Topics
- GPT-5.4
- AI Performance Evaluation
- Professional AI Capabilities
- Large Language Models
- AI Job Impact
Best for: Executive, Machine Learning Engineer, NLP Engineer, AI Engineer, AI Product Manager, Tech Journalist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by News and Advice on the World's Latest Innovations | ZDNET.