UNBOX: Unveiling Black-box visual models with Natural-language
Summary
UNBOX is a novel framework designed for class-wise model dissection of black-box visual recognition systems, operating under strict constraints where only output probabilities are accessible. It bypasses the need for architecture, parameters, gradients, or training data by reformulating activation maximization as a semantic search problem. The method employs Large Language Models (LLMs) and text-to-image diffusion models to iteratively refine natural-language descriptors that maximally activate specific classes. UNBOX was evaluated on ImageNet-1K, Waterbirds, and CelebA, demonstrating competitive performance against state-of-the-art white-box interpretability methods in semantic fidelity, latent training semantics recovery, and bias discovery. It successfully uncovers concepts implicitly learned by models, reflects their training distributions, and identifies potential sources of bias, enabling more trustworthy and accountable visual recognition systems.
Key takeaway
For research scientists developing or deploying proprietary visual recognition models, UNBOX offers a critical tool for auditing model behavior and biases without requiring internal access. You can use its data-free, gradient-free approach to uncover latent training semantics and systematic spurious correlations, which is essential for ensuring fairness and robustness in open-world settings. This capability allows you to perform meaningful model dissection and debiasing even when only inference APIs are available, enhancing accountability and trustworthiness.
Key insights
UNBOX enables black-box visual model interpretability by using LLMs and diffusion models for semantic search.
Principles
- Interpretability is possible with only output probabilities.
- Semantic search can approximate gradient-based optimization.
- Contextual memory stabilizes linguistic optimization.
Method
UNBOX uses LLM agents and text-to-image diffusion to iteratively refine natural-language prompts, guided by classifier output probabilities' trend and intensity, and stabilized by global and local optimization contexts.
In practice
- Audit proprietary vision APIs for bias.
- Understand model reasoning without internal access.
- Generate debiasing pseudo-labels from descriptors.
Topics
- Black-box Explainability
- Semantic Optimization
- Large Language Models
- Text-to-Image Diffusion
- Bias Discovery
Code references
Best for: Research Scientist, AI Scientist, AI Security Engineer, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.