Interpreting Brain Responses to Language with Sparse Features from Language Models

2026-06-05 · Source: Computation and Language · Field: Science & Research — Life Sciences & Biology, Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Augmented Sparse Encoding Models (ASEMs) offer a new framework for interpreting human brain responses to language by replacing dense language model (LM) hidden states with hierarchically-organized sparse autoencoder (SAE) features, explicitly including surprisal as a predictor. This approach, applied to a high-field 7T fMRI dataset from eight participants listening to 200 linguistically diverse sentences, validates previous interpretations of voxel populations tuned to processing difficulty and meaning abstractness. The study further identifies a previously uncharacterized voxel population tuned to people-related content. It demonstrates that the fronto-temporal human language network is predicted by a common set of features, though frontal regions are well-explained by surprisal alone. Crucially, brain responses are best explained by LM features capturing general information, indicating a non-trivial correspondence between brain and LM language representation.

Key takeaway

For AI Scientists and Research Scientists developing or evaluating language models against neural data, this work suggests that focusing on sparse, interpretable features from LMs, rather than dense hidden states, can yield more meaningful insights into brain-model alignment. You should consider incorporating surprisal as a distinct predictor, especially when analyzing frontal brain regions, to better understand the specific contributions of different linguistic features to neural responses. This approach offers a clearer path to interpreting the "black box" problem in cognitive neuroscience.

Key insights

Sparse autoencoder features and surprisal can interpret human language cortex responses to language.

Principles

Brain responses align with general information encoded in LMs.
Fronto-temporal language network uses common features across regions.
Frontal brain regions are significantly predicted by surprisal alone.

Method

Augmented Sparse Encoding Models (ASEMs) replace dense LM hidden states with hierarchically-organized sparse autoencoder (SAE) features, adding surprisal as a predictor for fMRI data analysis.

In practice

Use SAE features for interpretable LM-brain alignment studies.
Consider surprisal as a primary predictor for frontal brain activity.

Topics

Language Models
Cognitive Neuroscience
fMRI Data Analysis
Sparse Autoencoders
Neural Decoding
Brain-LM Alignment

Best for: AI Scientist, Research Scientist

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.