Cross-Architecture Model Diffing with Crosscoders: Unsupervised Discovery of Differences Between LLMs
Summary
A new study introduces the first application of crosscoders to cross-architecture model diffing, a technique for comparing internal representations of large language models (LLMs) to identify behavioral differences. The authors, Thomas Jiralerspong and Trenton Bricken, also present Dedicated Feature Crosscoders (DFCs), an architectural modification designed to better isolate features unique to a single model. Using this unsupervised method, they identified specific features such as Chinese Communist Party alignment in Qwen3-8B and Deepseek-R1-0528-Qwen3-8B, American exceptionalism in Llama3.1-8B-Instruct, and a copyright refusal mechanism in GPT-OSS-20B. This work, published on February 12, 2026, aims to establish cross-architecture crosscoder model diffing as an effective method for uncovering meaningful behavioral distinctions between diverse AI models.
Key takeaway
For research scientists evaluating the safety and alignment of new LLM releases, this work demonstrates a practical method for unsupervised discovery of critical behavioral differences. You should consider integrating cross-architecture crosscoder model diffing, particularly with Dedicated Feature Crosscoders (DFCs), into your evaluation pipeline to proactively identify nuanced and potentially safety-critical features, even when comparing models with novel architectures.
Key insights
Cross-architecture model diffing with DFCs can unsupervisedly uncover specific behavioral alignments in LLMs.
Principles
- Cross-architecture diffing is essential for novel LLM comparisons.
- Dedicated Feature Crosscoders (DFCs) improve feature isolation.
Method
The method involves applying crosscoders to compare LLMs of different architectures, enhanced by Dedicated Feature Crosscoders (DFCs) to isolate unique features, enabling unsupervised discovery of behavioral differences.
In practice
- Identify political alignments in LLMs.
- Detect copyright refusal mechanisms.
- Uncover model-specific behavioral traits.
Topics
- Cross-Architecture Model Diffing
- Crosscoders
- Dedicated Feature Crosscoders
- LLM Behavior Analysis
- Unsupervised Feature Discovery
Code references
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.