Activation Oracles (AOs): large language models that are trained to “read” another model’s internal activations (the huge arrays of numbers produced inside the network while it thinks)

2025-11-28 · Source: Pascal’s Substack · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Intermediate, medium

Summary

The paper "ACTIVATION ORACLES: TRAINING AND EVALUATING LLMS AS GENERAL-PURPOSE ACTIVATION EXPLAINERS" introduces Activation Oracles (AOs), which are large language models trained to interpret another model's internal activations and answer natural-language questions about their content. AOs treat activations as a new input modality, allowing interrogation of a target model's "thought process-in-progress" to infer hidden instructions, knowledge, or changes post-fine-tuning. This approach scales beyond prior work by training on a mixture of tasks, including system-prompt QA, binary classification, and a self-supervised context prediction task. Evaluations show AOs can extract hidden objectives or secrets from activations with a single natural-language question, often outperforming complex interpretability pipelines, even generalizing to out-of-distribution "auditing games" where information is intentionally concealed.

Key takeaway

For AI Scientists and CTOs evaluating model safety and compliance, this research indicates that controls must extend beyond model outputs to internal signals and access pathways. You should implement secure audit architectures that separate operational model serving from interpretability pipelines, with strict access controls and logging. Additionally, mandate post-fine-tune and post-merge audits for critical-sector deployments, focusing on "hidden objective" tests rather than just benchmark accuracy, to uncover concealed behaviors and ensure alignment.

Key insights

Activation Oracles enable natural-language querying of LLM internal states to reveal hidden information and objectives.

Principles

Activations can be treated as a new input modality.
Training diversity improves auditing performance.
Secrets can leak even when models refuse to disclose them.

Method

Train a large language model to accept activation vectors as input, alongside text, to answer natural-language questions about the target model's internal state, using a mixture of supervised and self-supervised tasks.

In practice

Audit models for hidden instructions or biases.
Identify changes after fine-tuning via "model diffing."
Detect latent risk-seeking behaviors in financial models.

Topics

Activation Oracles
LLM Interpretability
Model Auditing
Internal Activations
AI Safety & Governance

Best for: AI Scientist, Research Scientist, CTO, AI Researcher, Policy Maker, Legal Professional

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Pascal’s Substack.