SwordBench: Evaluating Orthogonality of Steering Image Representations

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

SwordBench is a new benchmark designed to evaluate the steering of image representations in vision models, addressing limitations of existing language model-centric evaluation protocols. It focuses on assessing concept removal across various backbones like CLIP, SigLIP, and DINOv2, utilizing four real-world bias-removal datasets (CelebA, ISIC, Waterbirds, Counteranimal) and synthetic concept-infused tasks (ImageNet-W, ImageNet-C). The benchmark introduces novel evaluation metrics: "cross-concept robustness" to measure concept detection stability when orthogonalizing against other concepts, and "collateral damage" to quantify unintended performance degradation on downstream tasks for inputs lacking the bias. Initial findings indicate that while linear SVMs show strong separability and orthogonality, they often incur non-zero collateral damage, sometimes underperforming sparse autoencoders. Furthermore, standard classification metrics like AUC and F1 are insufficient proxies for concept entanglement, robustness, or collateral damage.

Key takeaway

For research scientists developing or deploying vision models with steering capabilities, you should integrate SwordBench's novel metrics, "cross-concept robustness" and "collateral damage," into your evaluation protocols. Relying solely on traditional metrics like AUC or F1 can mask critical second-order effects, such as unintended performance degradation or concept entanglement, which are vital for ensuring AI safety and interpretability in high-stakes applications. Your focus should extend beyond mere concept separability to comprehensive impact assessment.

Key insights

SwordBench evaluates image representation steering, introducing metrics for cross-concept robustness and collateral damage.

Principles

Method

SwordBench evaluates CAVs by orthogonalizing representations to remove concepts, then measuring downstream performance shifts, cross-concept robustness, and collateral damage on diverse vision models and datasets.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.