Cross-Architecture Model Diffing with Crosscoders: Unsupervised Discovery of Differences Between LLMs

2026-02-12 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, medium

Summary

A new study introduces the first application of crosscoders to cross-architecture model diffing, a technique for comparing internal representations of large language models (LLMs) to identify behavioral differences. The authors, Thomas Jiralerspong and Trenton Bricken, also present Dedicated Feature Crosscoders (DFCs), an architectural modification designed to better isolate features unique to a single model. Using this unsupervised method, they identified specific features such as Chinese Communist Party alignment in Qwen3-8B and Deepseek-R1-0528-Qwen3-8B, American exceptionalism in Llama3.1-8B-Instruct, and a copyright refusal mechanism in GPT-OSS-20B. This work, published on February 12, 2026, aims to establish cross-architecture crosscoder model diffing as an effective method for uncovering meaningful behavioral distinctions between diverse AI models.

Key takeaway

For research scientists evaluating the safety and alignment of new LLM releases, this work demonstrates a practical method for unsupervised discovery of critical behavioral differences. You should consider integrating cross-architecture crosscoder model diffing, particularly with Dedicated Feature Crosscoders (DFCs), into your evaluation pipeline to proactively identify nuanced and potentially safety-critical features, even when comparing models with novel architectures.

Key insights

Cross-architecture model diffing with DFCs can unsupervisedly uncover specific behavioral alignments in LLMs.

Principles

Cross-architecture diffing is essential for novel LLM comparisons.
Dedicated Feature Crosscoders (DFCs) improve feature isolation.

Method

The method involves applying crosscoders to compare LLMs of different architectures, enhanced by Dedicated Feature Crosscoders (DFCs) to isolate unique features, enabling unsupervised discovery of behavioral differences.

In practice

Identify political alignments in LLMs.
Detect copyright refusal mechanisms.
Uncover model-specific behavioral traits.

Topics

Cross-Architecture Model Diffing
Crosscoders
Dedicated Feature Crosscoders
LLM Behavior Analysis
Unsupervised Feature Discovery

Code references

yeyimilk/CrowdVLM-R1

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.