Discovering Interpretable Algorithms by Decompiling Transformers to RASP

2026-02-09 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A new method extracts interpretable RASP programs from trained Transformers, addressing the open question of whether these models internally implement simple, understandable algorithms. Published as arXiv:2602.08857 and accepted for ICML 2026, this research builds on prior findings that Transformer computations can be simulated in the RASP family of programming languages, linking length-generalization to simple RASP programs. The proposed technique involves faithfully re-parameterizing a Transformer as a RASP program, then applying causal interventions to discover a minimal sufficient sub-program. Experiments on small Transformers, trained on algorithmic and formal language tasks, frequently recovered simple and interpretable RASP programs from models exhibiting length-generalization, providing direct evidence for this internal mechanism. The paper is 104 pages long and includes 92 figures.

Key takeaway

For AI Scientists and Machine Learning Engineers focused on model interpretability, this research offers a direct method to verify if your length-generalizing Transformers implement simple, interpretable algorithms. You should consider applying this RASP-based decompilation and causal intervention technique to gain deeper insights into your model's internal mechanisms, enhancing trust and facilitating debugging. This approach provides concrete evidence of algorithmic implementation within complex neural networks.

Key insights

A method decompiles Transformers into RASP programs, revealing their internal interpretable algorithms.

Principles

Transformers' computations can be simulated in RASP.
Length-generalization correlates with simple RASP programs.
Causal interventions discover minimal sufficient sub-programs.

Method

The method re-parameterizes a Transformer as a RASP program, then applies causal interventions to discover a small, sufficient sub-program, often recovering simple and interpretable algorithms.

In practice

Extract interpretable algorithms from Transformers.
Understand Transformer expressive capacity.
Analyze Transformer generalization abilities.

Topics

Transformers
RASP programming language
Model interpretability
Algorithmic tasks
Length generalization

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.