LLM-Driven Feature Discovery

2026-06-23 · Source: AI Alignment Forum · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, medium

Summary

LLM-Driven Feature Discovery is a novel method for qualitatively understanding a target model's behaviors by analyzing its transcripts. The process involves selecting a dataset of model transcripts, splitting them into user turns, thoughts, and assistant responses. A black-box LLM "autorater" then generates 10-20 "features" for each piece, which are semantically embedded and clustered. Finally, another LLM names these clusters using 100 random features per cluster. Applied to 100k chat transcripts, generating 20k features, the method successfully identified interesting Gemini behaviors, such as token awareness and roleplay considerations. While similar to "Explaining Datasets in Words," this approach is simpler and unsupervised. The study found that logistic regression on user features generally struggled to predict subsequent thought or response features, except for clear correlations like HTTP status codes. This technique offers benefits over Sparse Autoencoders by providing clearer explanations without requiring access to model internals.

Key takeaway

For AI Scientists or Machine Learning Engineers evaluating model behaviors, you should consider LLM-Driven Feature Discovery as a robust, black-box alternative to traditional interpretability methods like SAEs. This approach allows you to qualitatively characterize model outputs and identify emergent behaviors without needing internal model access. You can apply this method to gain insights into deployment, training, or evaluation distributions, informing your model refinement strategies.

Key insights

LLMs can effectively discover and label qualitative behavioral features from model transcripts without internal access.

Principles

Black-box LLMs can act as effective "autoraters."
Semantic embedding and clustering reveal behavioral patterns.
Predicting model internal states from user input is challenging.

Method

Transcripts are split, an LLM generates 10-20 features per piece, features are embedded and clustered, then another LLM names clusters from 100 examples.

In practice

Use LLMs to characterize model outputs qualitatively.
Apply clustering to identify common behavioral themes.
Explore LLM-driven feature discovery for model evaluation.

Topics

LLM Interpretability
Feature Discovery
Model Behavior Analysis
Black-Box Models
Sparse Autoencoders
Gemini

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Alignment Forum.