Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath

2026-03-05 · Source: The Cognitive Revolution · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, short

Summary

Goodfire, a mechanistic interpretability startup, recently announced a $150M Series B fundraise at a $1.25B valuation and introduced "Intentional Design" as a new research pillar. This initiative aims to expand interpretability science beyond reverse engineering by focusing on shaping the loss landscape during training to control model learning and generalization. CTO Dan Balsam and Chief Scientist Tom McGrath discussed developments in interpretability, moving from sparse autoencoders to understanding geometric structures in latent spaces. Their first proof of concept for Intentional Design is a technique to reduce hallucinations using a probe for detection and as a reward signal for RL training, notably by running the probe on a frozen model copy. Goodfire also published work on Alzheimer's prediction based on cell-free DNA and a project demonstrating the separation and removal of memorization weights to improve reasoning performance.

Key takeaway

For research scientists working on AI alignment and model control, Goodfire's Intentional Design paradigm suggests a shift from purely diagnostic interpretability to proactive intervention. You should explore methods that shape the loss landscape to guide model learning, rather than attempting to counteract back-propagation directly. Consider how techniques like using frozen model copies for reward signals can mitigate risks like reward hacking in your own alignment research.

Key insights

Intentional Design aims to proactively shape AI model learning by influencing the loss landscape, rather than solely reverse-engineering.

Principles

Avoid fighting back-propagation.
Shape the loss landscape for desired learning.

Method

A hallucination reduction technique uses a probe to detect hallucinations, steering the model at runtime and providing a reward signal for RL training, with the probe running on a frozen model copy.

In practice

Separate memorization weights to improve reasoning.
Use frozen model copies for probe-based training.

Topics

Mechanistic Interpretability
Intentional Design
Hallucination Reduction
Loss Landscape Optimization
Model Generalization

Best for: Research Scientist, Investor, AI Researcher, AI Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Cognitive Revolution.