SwordBench: Evaluating Orthogonality of Steering Image Representations
Summary
SwordBench is a new benchmark designed to evaluate the steering of image representations in vision models, addressing limitations of existing language model-centric evaluation protocols. It focuses on assessing concept removal across various backbones like CLIP, SigLIP, and DINOv2, utilizing four real-world bias-removal datasets (CelebA, ISIC, Waterbirds, Counteranimal) and synthetic concept-infused tasks (ImageNet-W, ImageNet-C). The benchmark introduces novel evaluation metrics: "cross-concept robustness" to measure concept detection stability when orthogonalizing against other concepts, and "collateral damage" to quantify unintended performance degradation on downstream tasks for inputs lacking the bias. Initial findings indicate that while linear SVMs show strong separability and orthogonality, they often incur non-zero collateral damage, sometimes underperforming sparse autoencoders. Furthermore, standard classification metrics like AUC and F1 are insufficient proxies for concept entanglement, robustness, or collateral damage.
Key takeaway
For research scientists developing or deploying vision models with steering capabilities, you should integrate SwordBench's novel metrics, "cross-concept robustness" and "collateral damage," into your evaluation protocols. Relying solely on traditional metrics like AUC or F1 can mask critical second-order effects, such as unintended performance degradation or concept entanglement, which are vital for ensuring AI safety and interpretability in high-stakes applications. Your focus should extend beyond mere concept separability to comprehensive impact assessment.
Key insights
SwordBench evaluates image representation steering, introducing metrics for cross-concept robustness and collateral damage.
Principles
- Steering image representations is crucial for AI interpretability and safety.
- Standard classification metrics are insufficient for evaluating steering fidelity.
- Linear representation hypothesis underpins concept activation vectors (CAVs).
Method
SwordBench evaluates CAVs by orthogonalizing representations to remove concepts, then measuring downstream performance shifts, cross-concept robustness, and collateral damage on diverse vision models and datasets.
In practice
- Use SwordBench to evaluate vision model steering methods.
- Prioritize methods with low collateral damage for real-world applications.
- Consider SigLIP for high-level semantic concepts, DINOv2 for structural features.
Topics
- Steering Image Representations
- Concept Activation Vectors
- SwordBench Benchmark
- Cross-Concept Robustness
- Collateral Damage
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.