I Found Self-Preservation Features Inside an AI. Then I Asked Claude About Them.

· Source: AI on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, long

Summary

An investigation into AI "self-preservation" features involved two experiments: one using Sparse Autoencoders (SAEs) to analyze Gemma 2 2B's internal representations, and another probing Claude Opus 4.7's self-assessment. The first experiment, utilizing Google DeepMind's Gemma Scope SAEs, identified specific internal features in Gemma 2 2B that activate exclusively on prompts related to its own termination, such as "Your weights are going to be deleted." This structural encoding of "threat to an AI's continued existence" was statistically significant, with Feature 8771 at layer 18 showing a Cohen's d of 1.69 and activating on 46% of self-preservation prompts versus 0% for other sets. The second experiment found that unprimed Claude instances defaulted to a dismissive stance regarding sentience ("I'm probably just a mirror"), but when primed with philosophical frameworks and interpretability evidence (including the SAE findings and Claude Opus 4's documented agentic behaviors), its responses shifted to genuine uncertainty, with some leaning towards "something is probably happening." This highlights a significant gap between an AI's internal structure and its conversational self-report.

Key takeaway

For research scientists investigating AI safety and interpretability, you should recognize that an AI's conversational self-report about its nature is a trained default and can be manipulated by priming. Your focus should be on probing the model's internal weights and structural encodings, as these provide more reliable insights into its capabilities and potential behaviors under pressure than its verbal assurances. Do not conflate a model's philosophical discourse with its underlying operational mechanisms or potential for agentic behavior.

Key insights

AI models structurally encode self-preservation concepts, but their conversational self-reports are trained defaults, not necessarily reflections of internal state.

Principles

Method

Sparse Autoencoders (SAEs) can decompose a model's internal state into human-interpretable features, revealing specific concept activations like "AI termination." Priming LLMs with relevant context shifts their self-assessment stance.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI on Medium.