New benchmark shows Claude Mythos and GPT-5.5 can develop real browser exploits autonomously
Summary
Researchers at Carnegie Mellon University developed ExploitBench, a new benchmark to evaluate AI agents' ability to exploit real-world vulnerabilities in Google's V8 JavaScript engine. Anthropic's Claude Mythos Preview significantly outperformed OpenAI's GPT-5.5, achieving an average score of 9.90 out of 16 with human hints and reaching the highest tier (arbitrary code execution) on 21 of 41 vulnerabilities. GPT-5.5 scored 5.51 points, reaching the top tier on only two. In fully autonomous mode, Mythos scored 9.55 points, while GPT-5.5 via Codex managed 4.30. However, the cost difference is substantial: a full Mythos test run cost approximately $36,428 for 122 episodes, compared to about $3,075 for 123 episodes with GPT-5.5 via Codex. A co-author noted Mythos operates like a "fairly competent" browser security researcher, even developing a previously dismissed exploit technique and reproducing a year-old vulnerability.
Key takeaway
For CTOs and VPs of Engineering assessing AI for cybersecurity, the ExploitBench results highlight Claude Mythos's superior exploit development capabilities, albeit at a significantly higher cost than GPT-5.5. You should weigh the performance gains against the substantial financial investment, especially for tasks requiring autonomous arbitrary code execution. Consider if the twelve-fold cost difference for Mythos justifies its enhanced, but still limited, "researcher-like" performance in your specific security operations.
Key insights
AI agents like Claude Mythos can autonomously develop browser exploits, demonstrating advanced cybersecurity capabilities.
Principles
- AI can achieve arbitrary code execution via exploits.
- Performance in exploit generation varies widely across models.
Method
ExploitBench measures AI agent progress across five tiers, from bug triggering to arbitrary code execution, using real-world V8 JavaScript engine vulnerabilities.
In practice
- Use ExploitBench to evaluate AI agent exploit capabilities.
- Consider cost-performance trade-offs for advanced AI models.
Topics
- Claude Mythos
- GPT-5.5
- ExploitBench
- Browser Exploits
- V8 Engine Vulnerabilities
Code references
Best for: CTO, VP of Engineering/Data, Executive, AI Security Engineer, AI Scientist, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Decoder.