Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath
Summary
Goodfire, a mechanistic interpretability startup, recently announced a $150 million Series B fundraise, valuing the company at $1.25 billion. Co-founders Dan Balsam and Tom McGrath discussed their new research pillar, "Intentional Design," which aims to move beyond reverse-engineering models to actively shaping their learning processes. This approach involves understanding geometric structures in latent spaces, rather than just sparse autoencoder features. They presented a proof-of-concept for reducing hallucinations using a probe and reinforcement learning, noting that running the probe on a frozen model copy helps prevent reward hacking. Goodfire also shared results on Alzheimer's prediction, where their model identified cell-free DNA fragment length as a key biomarker, and a method to disentangle memorization from reasoning weights, which improved performance on some reasoning tasks by removing memorization-associated weights.
Key takeaway
For research scientists and AI developers focused on model alignment and control, exploring Intentional Design techniques is crucial. You should prioritize methods that shape the loss landscape rather than directly fighting gradient descent, as demonstrated by the hallucination reduction work. This approach can lead to more robust and controllable models, potentially enabling significant compute savings through improved sample efficiency and fostering scientific discovery by extracting novel insights from model behaviors.
Key insights
Intentional Design aims to proactively shape AI model learning by understanding latent space geometry and intervening in training.
Principles
- Avoid fighting back-propagation; instead, shape the loss landscape.
- Interpretability serves as an observation system for controllable training.
- Model capabilities can be improved by removing memorization-specific weights.
Method
A hallucination reduction technique uses a probe on a frozen model copy to detect hallucinations, providing a reward signal for RL training and enabling runtime token injection to correct errors.
In practice
- Utilize interpretability for biomarker discovery in life sciences.
- Employ gradient decomposition to guide model learning away from undesirable behaviors.
- Consider freezing specific circuits for surgical model updates.
Topics
- Mechanistic Interpretability
- Intentional Design
- AI Alignment
- Hallucination Reduction
- Alzheimer's Prediction
Best for: Investor, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Cognitive Revolution.