Attention Probes
Summary
Researchers at EleutherAI introduce "attention probes," a novel method for classifying internal states of language models that avoids traditional pooling techniques like mean pooling or last-token aggregation. This approach utilizes an attention layer to collect hidden states, incorporating multiple heads and a learned position bias, akin to cross-attention with a single learned query token. The proposed `attention_probe` function, detailed with pseudocode, projects hidden states to attention logits, applies a position bias, and then uses softmax to derive attention probabilities for value projection. Experiments on Gemma 2B and Gemma 2 2B models, using datasets like `Anthropic/election_questions` and `LabHC/bias_in_bios`, indicate that 8-head attention probes generally outperform mean probes, especially when mean probes are trained with AdamW. While single-head attention probes show mixed results, increasing head count correlates with higher attention weight entropy.
Key takeaway
For research scientists developing or evaluating language model interpretability tools, consider integrating attention probes into your toolkit. This method offers a competitive alternative to mean or last-token pooling, particularly with multi-head configurations, and can provide more granular insights into how models process information. You should experiment with the provided `attention-probes` library to assess its performance on your specific datasets and tasks, especially for models like Gemma 2B, and explore the interpretability benefits of analyzing attention patterns.
Key insights
Attention probes offer a multi-headed, attention-based alternative to traditional pooling for classifying language model states.
Principles
- Attention probes benefit from multiple heads.
- LBFGS optimizer improves mean/last-token probe performance.
Method
Attention probes collect hidden states via an attention layer with learned position bias, then project values weighted by attention probabilities to derive output, avoiding explicit pooling.
In practice
- Use 8-head attention probes for improved performance.
- Consider LBFGS for mean/last-token probe training.
- Analyze attention patterns for interpretability.
Topics
- Attention Probes
- Linear Probing
- Language Models
- Model Interpretability
- Attention Mechanisms
Code references
Best for: Research Scientist, AI Researcher, AI Scientist, Deep Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Blog on EleutherAI Blog.