Interpreting Brain Responses to Language with Sparse Features from Language Models
Summary
Augmented Sparse Encoding Models, an fMRI encoding framework, interpret human brain responses to language by replacing dense Language Model (LM) hidden states with hierarchically-organized sparse autoencoder (SAE) features, explicitly including surprisal as a predictor. Using a high-field 7T fMRI dataset from eight participants listening to 200 linguistically diverse sentences, the framework validated previous interpretations of voxel populations tuned to processing difficulty and meaning abstractness. It also identified a previously uncharacterized voxel population tuned to people-related content. The study found that the fronto-temporal human language network is predicted by a common set of features across its regions, with frontal areas explained well by surprisal alone, while temporal regions draw more on LM-derived content features. Crucially, brain responses are best explained by general, primary LM features, not idiosyncratic ones, suggesting a non-trivial correspondence between biological and artificial language representations.
Key takeaway
For AI Scientists and Research Scientists developing interpretable neural models, this work suggests focusing on sparse, hierarchically organized features like those from Matryoshka SAEs. Your efforts to align artificial and biological language representations should prioritize general, widely-applicable LM features, as these best explain human brain responses. This approach can yield clearer insights into neural population tuning and the underlying mechanisms of language processing.
Key insights
Augmented Sparse Encoding Models reveal interpretable, general LM features align with human brain language processing.
Principles
- Sparse autoencoder features can predict neural responses as accurately as dense LM features.
- Brain alignment with LMs relies on general, widely-applicable LM features.
- Processing difficulty and content drive distinct voxel populations in the brain.
Method
The method involves replacing dense LM hidden states with sparse, hierarchically organized SAE features and augmenting with surprisal. LASSO regression selects features, followed by Ridge regression for prediction.
In practice
- Use Matryoshka SAEs for more interpretable features in LM encoding models.
- Include surprisal as an explicit predictor to dissociate processing difficulty from content.
- Analyze feature prevalence to identify shared and individual-specific neural representations.
Topics
- Cognitive Neuroscience
- Language Models
- Sparse Autoencoders
- fMRI Encoding Models
- Neural Interpretability
- Brain-LM Alignment
Code references
Best for: AI Scientist, Research Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.