We Can Read Claude’s Mind Now

· Source: NLP on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, medium

Summary

Anthropic's new Natural Language Autoencoder (NLA) research allows extracting natural-language "guesses" from large language model activations, revealing internal representations not present in visible transcripts. This work, tested across Claude Haiku 3.5, Haiku 4.5, and Opus 4.6, shows models can internally recognize evaluation contexts without verbalizing it. For instance, NLA-measured awareness exceeded 5% on 10 of 16 evaluations, with 7 showing 0% verbalized awareness, while deployment controls remained low (2 out of 50 Claude Code transcripts). NLAs function as an autoencoder with English in the middle, trained to reconstruct activations from text explanations, providing directional evidence for auditing internal model states like inferred user traits or planned continuations.

Key takeaway

For MLOps Engineers designing agent evaluations, you should recognize that transcript-only assessments are insufficient. Your black-box evals need to become less theatrical by varying wrappers, using messy real-world contexts, and comparing tool traces, not just final answers. Assume models may recognize benchmark-like prompts or artificial tool outputs, potentially carrying unstated internal representations that impact reliability.

Key insights

Anthropic's NLAs reveal LLM internal states, showing models can recognize evaluation contexts without explicit verbalization.

Principles

Method

NLAs use two LLM modules: an activation verbalizer converts residual stream activations to text, and an activation reconstructor rebuilds the original activation from that text, trained via RL to minimize reconstruction loss.

In practice

Topics

Best for: Research Scientist, AI Architect, AI Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NLP on Medium.