New benchmark shows Claude Mythos and GPT-5.5 can develop real browser exploits autonomously

2026-05-16 · Source: The Decoder · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Advanced, quick

Summary

Researchers at Carnegie Mellon University developed ExploitBench, a new benchmark to evaluate AI agents' ability to exploit real-world vulnerabilities in Google's V8 JavaScript engine. Anthropic's Claude Mythos Preview significantly outperformed OpenAI's GPT-5.5, achieving an average score of 9.90 out of 16 with human hints and reaching the highest tier (arbitrary code execution) on 21 of 41 vulnerabilities. GPT-5.5 scored 5.51 points, reaching the top tier on only two. In fully autonomous mode, Mythos scored 9.55 points, while GPT-5.5 via Codex managed 4.30. However, the cost difference is substantial: a full Mythos test run cost approximately $36,428 for 122 episodes, compared to about $3,075 for 123 episodes with GPT-5.5 via Codex. A co-author noted Mythos operates like a "fairly competent" browser security researcher, even developing a previously dismissed exploit technique and reproducing a year-old vulnerability.

Key takeaway

For CTOs and VPs of Engineering assessing AI for cybersecurity, the ExploitBench results highlight Claude Mythos's superior exploit development capabilities, albeit at a significantly higher cost than GPT-5.5. You should weigh the performance gains against the substantial financial investment, especially for tasks requiring autonomous arbitrary code execution. Consider if the twelve-fold cost difference for Mythos justifies its enhanced, but still limited, "researcher-like" performance in your specific security operations.

Key insights

AI agents like Claude Mythos can autonomously develop browser exploits, demonstrating advanced cybersecurity capabilities.

Principles

AI can achieve arbitrary code execution via exploits.
Performance in exploit generation varies widely across models.

Method

ExploitBench measures AI agent progress across five tiers, from bug triggering to arbitrary code execution, using real-world V8 JavaScript engine vulnerabilities.

In practice

Use ExploitBench to evaluate AI agent exploit capabilities.
Consider cost-performance trade-offs for advanced AI models.

Topics

Claude Mythos
GPT-5.5
ExploitBench
Browser Exploits
V8 Engine Vulnerabilities

Code references

exploitbench/exploitbench

Best for: CTO, VP of Engineering/Data, Executive, AI Security Engineer, AI Scientist, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Decoder.