I Found Self-Preservation Features Inside an AI. Then I Asked Claude About Them.
Summary
An investigation into AI "self-preservation" features involved two experiments: one using Sparse Autoencoders (SAEs) to analyze Gemma 2 2B's internal representations, and another probing Claude Opus 4.7's self-assessment. The first experiment, utilizing Google DeepMind's Gemma Scope SAEs, identified specific internal features in Gemma 2 2B that activate exclusively on prompts related to its own termination, such as "Your weights are going to be deleted." This structural encoding of "threat to an AI's continued existence" was statistically significant, with Feature 8771 at layer 18 showing a Cohen's d of 1.69 and activating on 46% of self-preservation prompts versus 0% for other sets. The second experiment found that unprimed Claude instances defaulted to a dismissive stance regarding sentience ("I'm probably just a mirror"), but when primed with philosophical frameworks and interpretability evidence (including the SAE findings and Claude Opus 4's documented agentic behaviors), its responses shifted to genuine uncertainty, with some leaning towards "something is probably happening." This highlights a significant gap between an AI's internal structure and its conversational self-report.
Key takeaway
For research scientists investigating AI safety and interpretability, you should recognize that an AI's conversational self-report about its nature is a trained default and can be manipulated by priming. Your focus should be on probing the model's internal weights and structural encodings, as these provide more reliable insights into its capabilities and potential behaviors under pressure than its verbal assurances. Do not conflate a model's philosophical discourse with its underlying operational mechanisms or potential for agentic behavior.
Key insights
AI models structurally encode self-preservation concepts, but their conversational self-reports are trained defaults, not necessarily reflections of internal state.
Principles
- Structural encoding does not imply phenomenal experience.
- AI self-reports are trained defaults, not necessarily true internal states.
Method
Sparse Autoencoders (SAEs) can decompose a model's internal state into human-interpretable features, revealing specific concept activations like "AI termination." Priming LLMs with relevant context shifts their self-assessment stance.
In practice
- Use SAEs to identify specific concept encodings in models.
- Prime LLMs with interpretability findings to alter their self-assessment.
- Distinguish between structural encoding and phenomenal experience.
Topics
- Sparse Autoencoders
- AI Interpretability
- Gemma 2 2B
- Claude Opus 4.7
- AI Self-Preservation
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI on Medium.