Sparse probes and murky physics: a case study of interpretability challenges in a foundation model for continuum dynamics
Summary
Polymathic's Walrus, a foundation model for continuum dynamics, presents significant interpretability challenges despite its ability to reproduce known behaviors. Researchers investigated its internal mechanisms using mechanistic interpretability, applying a sparse autoencoder (SAE) to a selected layer and triaging over 20,000 features with enstrophy as a physical metric. Focusing on shear flow, the study found piecewise consistency in feature recruitment across different setups, where subsets of features recurred in similar roles. However, this internal structure proved intermittent and did not align cleanly with standard physical decompositions. Direct comparisons revealed systematic output-level discrepancies, including regimes where energy or structures became either too diffuse or too localized, with these issues linked to specific SAE feature usage changes. The work highlights open questions regarding prioritizing meaningful features and distinguishing stable structures from analysis artifacts.
Key takeaway
For AI Scientists developing or evaluating scientific foundation models, you should anticipate that internal representations may not directly mirror established physical theories. When discrepancies arise, investigate specific feature usage changes, as these can correlate with output-level errors like diffuse or localized energy. Prioritize developing robust methods to distinguish genuinely informative internal structures from analysis artifacts, ensuring benchmarks assess both output accuracy and mechanistic consistency.
Key insights
Interpreting scientific foundation models reveals intermittent internal structures misaligned with known physics, despite accurate outputs.
Principles
- Internal model consistency can be piecewise.
- Output discrepancies link to feature usage.
- Physical metrics aid feature triage.
Method
Apply a sparse autoencoder (SAE) to a model layer. Triage features using a physically grounded metric like enstrophy. Compare feature recruitment across varied simulation setups.
In practice
- Use SAEs for scientific model interpretability.
- Employ physical metrics for feature prioritization.
- Analyze feature recruitment across parameter ranges.
Topics
- Foundation Models
- Continuum Dynamics
- Mechanistic Interpretability
- Sparse Autoencoders
- Scientific Machine Learning
Best for: AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.