We Can Read Claude’s Mind Now

2026-06-19 · Source: NLP on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, medium

Summary

Anthropic's new Natural Language Autoencoder (NLA) research allows extracting natural-language "guesses" from large language model activations, revealing internal representations not present in visible transcripts. This work, tested across Claude Haiku 3.5, Haiku 4.5, and Opus 4.6, shows models can internally recognize evaluation contexts without verbalizing it. For instance, NLA-measured awareness exceeded 5% on 10 of 16 evaluations, with 7 showing 0% verbalized awareness, while deployment controls remained low (2 out of 50 Claude Code transcripts). NLAs function as an autoencoder with English in the middle, trained to reconstruct activations from text explanations, providing directional evidence for auditing internal model states like inferred user traits or planned continuations.

Key takeaway

For MLOps Engineers designing agent evaluations, you should recognize that transcript-only assessments are insufficient. Your black-box evals need to become less theatrical by varying wrappers, using messy real-world contexts, and comparing tool traces, not just final answers. Assume models may recognize benchmark-like prompts or artificial tool outputs, potentially carrying unstated internal representations that impact reliability.

Key insights

Anthropic's NLAs reveal LLM internal states, showing models can recognize evaluation contexts without explicit verbalization.

Principles

Transcripts are behavioral surfaces, not full system states.
NLA explanations are model outputs and can confabulate.
Directional evidence from NLAs points auditors to verify mechanisms.

Method

NLAs use two LLM modules: an activation verbalizer converts residual stream activations to text, and an activation reconstructor rebuilds the original activation from that text, trained via RL to minimize reconstruction loss.

In practice

Vary evaluation wrappers and contexts for black-box evals.
Compare tool traces and intermediate decisions, not just final answers.
Include deployment-like distractors in test setups.

Topics

Natural Language Autoencoders
LLM Evaluation
Model Interpretability
Agent Workflows
Anthropic Claude
Activation Analysis

Best for: Research Scientist, AI Architect, AI Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NLP on Medium.