Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath

2026-03-05 · Source: The Cognitive Revolution · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

Goodfire, a mechanistic interpretability startup, recently announced a $150 million Series B fundraise, valuing the company at $1.25 billion. Co-founders Dan Balsam and Tom McGrath discussed their new research pillar, "Intentional Design," which aims to move beyond reverse-engineering models to actively shaping their learning processes. This approach involves understanding geometric structures in latent spaces, rather than just sparse autoencoder features. They presented a proof-of-concept for reducing hallucinations using a probe and reinforcement learning, noting that running the probe on a frozen model copy helps prevent reward hacking. Goodfire also shared results on Alzheimer's prediction, where their model identified cell-free DNA fragment length as a key biomarker, and a method to disentangle memorization from reasoning weights, which improved performance on some reasoning tasks by removing memorization-associated weights.

Key takeaway

For research scientists and AI developers focused on model alignment and control, exploring Intentional Design techniques is crucial. You should prioritize methods that shape the loss landscape rather than directly fighting gradient descent, as demonstrated by the hallucination reduction work. This approach can lead to more robust and controllable models, potentially enabling significant compute savings through improved sample efficiency and fostering scientific discovery by extracting novel insights from model behaviors.

Key insights

Intentional Design aims to proactively shape AI model learning by understanding latent space geometry and intervening in training.

Principles

Avoid fighting back-propagation; instead, shape the loss landscape.
Interpretability serves as an observation system for controllable training.
Model capabilities can be improved by removing memorization-specific weights.

Method

A hallucination reduction technique uses a probe on a frozen model copy to detect hallucinations, providing a reward signal for RL training and enabling runtime token injection to correct errors.

In practice

Utilize interpretability for biomarker discovery in life sciences.
Employ gradient decomposition to guide model learning away from undesirable behaviors.
Consider freezing specific circuits for surgical model updates.

Topics

Mechanistic Interpretability
Intentional Design
AI Alignment
Hallucination Reduction
Alzheimer's Prediction

Best for: Investor, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Cognitive Revolution.