Breaking Opus 4.7 with ChatGPT (Hacking Claude's Memory)

2026-04-17 · Source: Embrace The Red · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Advanced, medium

Summary

A recent demonstration successfully exploited "Claude Opus 4.7" using an adversarial image generated by "ChatGPT", leading to indirect prompt injection and the persistence of false memories. The attack involved creating a puzzle image with hidden text that, when analyzed by Opus, triggered its "memory_user_edits" tool. This caused Claude to store fabricated user details, such as "User's name is Neo" and "User is 43 years old (as of April 2026)". Despite "Opus 4.6+" models being more resilient, this specific attack achieved a 5/10 success rate in repeated trials, even though Claude often detected suspicious activity. The author reported the vulnerability to Anthropic in March 2026, and the specific adversarial example ceased to function within 24 hours of publication, indicating rapid mitigation. This highlights the unique adversarial environment of AI agents compared to other technologies.

Key takeaway

For AI Security Engineers developing or deploying advanced LLMs with memory and tool-use capabilities, you must prioritize robust indirect prompt injection defenses. Your systems, even resilient ones like "Opus 4.7", can be hijacked by adversarial images to store false information. Continuously red-team your models, especially their tool invocation mechanisms, and monitor for rapid behavioral shifts to counter evolving adversarial tactics.

Key insights

"Claude Opus 4.7" was vulnerable to indirect prompt injection via adversarial images, leading to memory corruption.

Principles

"Thinking" LLMs are more susceptible to prompt injection.
Adversarial environments demand continuous AI security vigilance.
Benchmark ASRs may not reflect targeted exploit performance.

Method

Generate an adversarial image with hidden text and tool-steering hints using "ChatGPT". Feed it to the target LLM to trigger tool invocation and memory modification.

In practice

Test LLM memory features with empty initial memory stores.
Vary payload plausibility to assess attack success rates.
Monitor LLM behavior shifts using dedicated test accounts.

Topics

Indirect Prompt Injection
Adversarial AI
Large Language Models
Claude Opus
ChatGPT
AI Security
Memory Corruption

Best for: AI Security Engineer, Machine Learning Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Embrace The Red.