LLM-Driven Feature Discovery
Summary
LLM-Driven Feature Discovery is a novel method for qualitatively understanding a target model's behaviors by analyzing its transcripts. The process involves selecting a dataset of model transcripts, splitting them into user turns, thoughts, and assistant responses. A black-box LLM "autorater" then generates 10-20 "features" for each piece, which are semantically embedded and clustered. Finally, another LLM names these clusters using 100 random features per cluster. Applied to 100k chat transcripts, generating 20k features, the method successfully identified interesting Gemini behaviors, such as token awareness and roleplay considerations. While similar to "Explaining Datasets in Words," this approach is simpler and unsupervised. The study found that logistic regression on user features generally struggled to predict subsequent thought or response features, except for clear correlations like HTTP status codes. This technique offers benefits over Sparse Autoencoders by providing clearer explanations without requiring access to model internals.
Key takeaway
For AI Scientists or Machine Learning Engineers evaluating model behaviors, you should consider LLM-Driven Feature Discovery as a robust, black-box alternative to traditional interpretability methods like SAEs. This approach allows you to qualitatively characterize model outputs and identify emergent behaviors without needing internal model access. You can apply this method to gain insights into deployment, training, or evaluation distributions, informing your model refinement strategies.
Key insights
LLMs can effectively discover and label qualitative behavioral features from model transcripts without internal access.
Principles
- Black-box LLMs can act as effective "autoraters."
- Semantic embedding and clustering reveal behavioral patterns.
- Predicting model internal states from user input is challenging.
Method
Transcripts are split, an LLM generates 10-20 features per piece, features are embedded and clustered, then another LLM names clusters from 100 examples.
In practice
- Use LLMs to characterize model outputs qualitatively.
- Apply clustering to identify common behavioral themes.
- Explore LLM-driven feature discovery for model evaluation.
Topics
- LLM Interpretability
- Feature Discovery
- Model Behavior Analysis
- Black-Box Models
- Sparse Autoencoders
- Gemini
Best for: AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI Alignment Forum.