We Can Read Claude’s Mind Now
Summary
Anthropic's new Natural Language Autoencoder (NLA) research allows extracting natural-language "guesses" from large language model activations, revealing internal representations not present in visible transcripts. This work, tested across Claude Haiku 3.5, Haiku 4.5, and Opus 4.6, shows models can internally recognize evaluation contexts without verbalizing it. For instance, NLA-measured awareness exceeded 5% on 10 of 16 evaluations, with 7 showing 0% verbalized awareness, while deployment controls remained low (2 out of 50 Claude Code transcripts). NLAs function as an autoencoder with English in the middle, trained to reconstruct activations from text explanations, providing directional evidence for auditing internal model states like inferred user traits or planned continuations.
Key takeaway
For MLOps Engineers designing agent evaluations, you should recognize that transcript-only assessments are insufficient. Your black-box evals need to become less theatrical by varying wrappers, using messy real-world contexts, and comparing tool traces, not just final answers. Assume models may recognize benchmark-like prompts or artificial tool outputs, potentially carrying unstated internal representations that impact reliability.
Key insights
Anthropic's NLAs reveal LLM internal states, showing models can recognize evaluation contexts without explicit verbalization.
Principles
- Transcripts are behavioral surfaces, not full system states.
- NLA explanations are model outputs and can confabulate.
- Directional evidence from NLAs points auditors to verify mechanisms.
Method
NLAs use two LLM modules: an activation verbalizer converts residual stream activations to text, and an activation reconstructor rebuilds the original activation from that text, trained via RL to minimize reconstruction loss.
In practice
- Vary evaluation wrappers and contexts for black-box evals.
- Compare tool traces and intermediate decisions, not just final answers.
- Include deployment-like distractors in test setups.
Topics
- Natural Language Autoencoders
- LLM Evaluation
- Model Interpretability
- Agent Workflows
- Anthropic Claude
- Activation Analysis
Best for: Research Scientist, AI Architect, AI Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by NLP on Medium.