A Red-Team Study of Anthropic Fable 5 & Opus 4.8 Models

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

A red-team study evaluated the adversarial robustness of Anthropic's frontier large language models, Fable 5 and Opus 4.8. Using the HackAgent framework, researchers tested the models against four families of automated jailbreak attacks across 7,826 harmful intents within a ten-category harm taxonomy. While both models resisted most attacks, the study found that aggregate resistance rates are misleading. Adaptive iterative attacks proved highly effective, breaking Opus 4.8 on 11.5% of intents and Fable 5 on 6.1%. Opus 4.8 produced 1,620 and Fable 5 produced 702 panel-confirmed harmful completions, located automatically and cheaply, demonstrating that even hardened frontier models remain reliably breakable under sustained automated pressure.

Key takeaway

For AI Security Engineers evaluating LLM deployments or AI Scientists developing robust models, you should not rely solely on aggregate jailbreak resistance metrics. Frontier models like Anthropic's Fable 5 and Opus 4.8 remain reliably breakable by automated, adaptive attacks. Prioritize robust red-teaming with adaptive iterative attack methods and automated adjudication to uncover persistent vulnerabilities, rather than assuming high aggregate resistance implies safety.

Key insights

Frontier LLMs, despite hardening, remain reliably vulnerable to automated, adaptive jailbreak attacks.

Principles

Method

The HackAgent red-teaming framework generates adversarial attempts, with three judge models independently adjudicating successes via majority vote. This process identifies harmful completions across a ten-category harm taxonomy.

In practice

Topics

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Security Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.