SwordBench: Evaluating Orthogonality of Steering Image Representations

2026-05-19 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

SwordBench is a new benchmark designed to evaluate the steering of image representations in vision models, addressing limitations of existing language model-centric evaluation protocols. It focuses on assessing concept removal across various backbones like CLIP, SigLIP, and DINOv2, utilizing four real-world bias-removal datasets (CelebA, ISIC, Waterbirds, Counteranimal) and synthetic concept-infused tasks (ImageNet-W, ImageNet-C). The benchmark introduces novel evaluation metrics: "cross-concept robustness" to measure concept detection stability when orthogonalizing against other concepts, and "collateral damage" to quantify unintended performance degradation on downstream tasks for inputs lacking the bias. Initial findings indicate that while linear SVMs show strong separability and orthogonality, they often incur non-zero collateral damage, sometimes underperforming sparse autoencoders. Furthermore, standard classification metrics like AUC and F1 are insufficient proxies for concept entanglement, robustness, or collateral damage.

Key takeaway

For research scientists developing or deploying vision models with steering capabilities, you should integrate SwordBench's novel metrics, "cross-concept robustness" and "collateral damage," into your evaluation protocols. Relying solely on traditional metrics like AUC or F1 can mask critical second-order effects, such as unintended performance degradation or concept entanglement, which are vital for ensuring AI safety and interpretability in high-stakes applications. Your focus should extend beyond mere concept separability to comprehensive impact assessment.

Key insights

SwordBench evaluates image representation steering, introducing metrics for cross-concept robustness and collateral damage.

Principles

Steering image representations is crucial for AI interpretability and safety.
Standard classification metrics are insufficient for evaluating steering fidelity.
Linear representation hypothesis underpins concept activation vectors (CAVs).

Method

SwordBench evaluates CAVs by orthogonalizing representations to remove concepts, then measuring downstream performance shifts, cross-concept robustness, and collateral damage on diverse vision models and datasets.

In practice

Use SwordBench to evaluate vision model steering methods.
Prioritize methods with low collateral damage for real-world applications.
Consider SigLIP for high-level semantic concepts, DINOv2 for structural features.

Topics

Steering Image Representations
Concept Activation Vectors
SwordBench Benchmark
Cross-Concept Robustness
Collateral Damage

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.