Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath
Summary
Goodfire, a mechanistic interpretability startup, recently announced a $150M Series B fundraise at a $1.25B valuation and introduced "Intentional Design" as a new research pillar. This initiative aims to expand interpretability science beyond reverse engineering by focusing on shaping the loss landscape during training to control model learning and generalization. CTO Dan Balsam and Chief Scientist Tom McGrath discussed developments in interpretability, moving from sparse autoencoders to understanding geometric structures in latent spaces. Their first proof of concept for Intentional Design is a technique to reduce hallucinations using a probe for detection and as a reward signal for RL training, notably by running the probe on a frozen model copy. Goodfire also published work on Alzheimer's prediction based on cell-free DNA and a project demonstrating the separation and removal of memorization weights to improve reasoning performance.
Key takeaway
For research scientists working on AI alignment and model control, Goodfire's Intentional Design paradigm suggests a shift from purely diagnostic interpretability to proactive intervention. You should explore methods that shape the loss landscape to guide model learning, rather than attempting to counteract back-propagation directly. Consider how techniques like using frozen model copies for reward signals can mitigate risks like reward hacking in your own alignment research.
Key insights
Intentional Design aims to proactively shape AI model learning by influencing the loss landscape, rather than solely reverse-engineering.
Principles
- Avoid fighting back-propagation.
- Shape the loss landscape for desired learning.
Method
A hallucination reduction technique uses a probe to detect hallucinations, steering the model at runtime and providing a reward signal for RL training, with the probe running on a frozen model copy.
In practice
- Separate memorization weights to improve reasoning.
- Use frozen model copies for probe-based training.
Topics
- Mechanistic Interpretability
- Intentional Design
- Hallucination Reduction
- Loss Landscape Optimization
- Model Generalization
Best for: Research Scientist, Investor, AI Researcher, AI Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Cognitive Revolution.