Discovering Interpretable Algorithms by Decompiling Transformers to RASP
Summary
A new method extracts interpretable RASP programs from trained Transformers, addressing the open question of whether these models internally implement simple, understandable algorithms. Published as arXiv:2602.08857 and accepted for ICML 2026, this research builds on prior findings that Transformer computations can be simulated in the RASP family of programming languages, linking length-generalization to simple RASP programs. The proposed technique involves faithfully re-parameterizing a Transformer as a RASP program, then applying causal interventions to discover a minimal sufficient sub-program. Experiments on small Transformers, trained on algorithmic and formal language tasks, frequently recovered simple and interpretable RASP programs from models exhibiting length-generalization, providing direct evidence for this internal mechanism. The paper is 104 pages long and includes 92 figures.
Key takeaway
For AI Scientists and Machine Learning Engineers focused on model interpretability, this research offers a direct method to verify if your length-generalizing Transformers implement simple, interpretable algorithms. You should consider applying this RASP-based decompilation and causal intervention technique to gain deeper insights into your model's internal mechanisms, enhancing trust and facilitating debugging. This approach provides concrete evidence of algorithmic implementation within complex neural networks.
Key insights
A method decompiles Transformers into RASP programs, revealing their internal interpretable algorithms.
Principles
- Transformers' computations can be simulated in RASP.
- Length-generalization correlates with simple RASP programs.
- Causal interventions discover minimal sufficient sub-programs.
Method
The method re-parameterizes a Transformer as a RASP program, then applies causal interventions to discover a small, sufficient sub-program, often recovering simple and interpretable algorithms.
In practice
- Extract interpretable algorithms from Transformers.
- Understand Transformer expressive capacity.
- Analyze Transformer generalization abilities.
Topics
- Transformers
- RASP programming language
- Model interpretability
- Algorithmic tasks
- Length generalization
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.