Interpreting Brain Responses to Language with Sparse Features from Language Models

2026-06-08 · Source: cs.CL updates on arXiv.org · Field: Science & Research — Life Sciences & Biology, Mathematics & Computational Sciences, Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

Augmented Sparse Encoding Models, an fMRI encoding framework, interpret human brain responses to language by replacing dense Language Model (LM) hidden states with hierarchically-organized sparse autoencoder (SAE) features, explicitly including surprisal as a predictor. Using a high-field 7T fMRI dataset from eight participants listening to 200 linguistically diverse sentences, the framework validated previous interpretations of voxel populations tuned to processing difficulty and meaning abstractness. It also identified a previously uncharacterized voxel population tuned to people-related content. The study found that the fronto-temporal human language network is predicted by a common set of features across its regions, with frontal areas explained well by surprisal alone, while temporal regions draw more on LM-derived content features. Crucially, brain responses are best explained by general, primary LM features, not idiosyncratic ones, suggesting a non-trivial correspondence between biological and artificial language representations.

Key takeaway

For AI Scientists and Research Scientists developing interpretable neural models, this work suggests focusing on sparse, hierarchically organized features like those from Matryoshka SAEs. Your efforts to align artificial and biological language representations should prioritize general, widely-applicable LM features, as these best explain human brain responses. This approach can yield clearer insights into neural population tuning and the underlying mechanisms of language processing.

Key insights

Augmented Sparse Encoding Models reveal interpretable, general LM features align with human brain language processing.

Principles

Sparse autoencoder features can predict neural responses as accurately as dense LM features.
Brain alignment with LMs relies on general, widely-applicable LM features.
Processing difficulty and content drive distinct voxel populations in the brain.

Method

The method involves replacing dense LM hidden states with sparse, hierarchically organized SAE features and augmenting with surprisal. LASSO regression selects features, followed by Ridge regression for prediction.

In practice

Use Matryoshka SAEs for more interpretable features in LM encoding models.
Include surprisal as an explicit predictor to dissociate processing difficulty from content.
Analyze feature prevalence to identify shared and individual-specific neural representations.

Topics

Cognitive Neuroscience
Language Models
Sparse Autoencoders
fMRI Encoding Models
Neural Interpretability
Brain-LM Alignment

Code references

mlepori1/Interpretable_Encoding_Models

Best for: AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.