What happened after 2,000 people tried to hack my AI assistant

· Source: Simon Willison's Weblog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Intermediate, quick

Summary

Fernando Irarrázaval conducted a challenge on hackmyclaw.com, inviting 2,000 participants to attempt leaking secrets from his OpenClaw AI assistant via email. Despite 6,000 attempts, \$500 in token spend, and a Google account suspension due to email volume, no participant successfully extracted the secret. The AI assistant, powered by Opus 4.6, utilized specific anti-prompt-injection rules, including directives to never reveal credentials, modify files, execute commands, or exfiltrate data based on email content. This outcome suggests that the significant efforts by AI labs to train frontier models against injection attacks, as noted in the GPT-5.6 system card, are proving effective in enhancing their resilience.

Key takeaway

For AI Security Engineers evaluating LLM deployment risks, this challenge highlights improved prompt injection resistance in frontier models like Opus 4.6. However, you should not interpret 6,000 failed attempts as a guarantee against more sophisticated future attacks. Always implement robust defense-in-depth strategies and avoid deploying production systems where a successful injection could cause irreversible damage, regardless of initial testing results.

Key insights

Frontier large language models demonstrate increased resistance to prompt injection attacks due to dedicated safety training.

Principles

Method

The OpenClaw instance used explicit anti-prompt-injection rules within its prompt, forbidding revelation of secrets, file modification, command execution, or data exfiltration.

In practice

Topics

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Security Engineer, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Simon Willison's Weblog.