UNBOX: Unveiling Black-box visual models with Natural-language

2026-04-16 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Explainable AI · Depth: Expert, extended

Summary

UNBOX is a novel framework designed for class-wise model dissection of black-box visual recognition systems, operating under strict constraints where only output probabilities are accessible. It bypasses the need for architecture, parameters, gradients, or training data by reformulating activation maximization as a semantic search problem. The method employs Large Language Models (LLMs) and text-to-image diffusion models to iteratively refine natural-language descriptors that maximally activate specific classes. UNBOX was evaluated on ImageNet-1K, Waterbirds, and CelebA, demonstrating competitive performance against state-of-the-art white-box interpretability methods in semantic fidelity, latent training semantics recovery, and bias discovery. It successfully uncovers concepts implicitly learned by models, reflects their training distributions, and identifies potential sources of bias, enabling more trustworthy and accountable visual recognition systems.

Key takeaway

For research scientists developing or deploying proprietary visual recognition models, UNBOX offers a critical tool for auditing model behavior and biases without requiring internal access. You can use its data-free, gradient-free approach to uncover latent training semantics and systematic spurious correlations, which is essential for ensuring fairness and robustness in open-world settings. This capability allows you to perform meaningful model dissection and debiasing even when only inference APIs are available, enhancing accountability and trustworthiness.

Key insights

UNBOX enables black-box visual model interpretability by using LLMs and diffusion models for semantic search.

Principles

Interpretability is possible with only output probabilities.
Semantic search can approximate gradient-based optimization.
Contextual memory stabilizes linguistic optimization.

Method

UNBOX uses LLM agents and text-to-image diffusion to iteratively refine natural-language prompts, guided by classifier output probabilities' trend and intensity, and stabilized by global and local optimization contexts.

In practice

Audit proprietary vision APIs for bias.
Understand model reasoning without internal access.
Generate debiasing pseudo-labels from descriptors.

Topics

Black-box Explainability
Semantic Optimization
Large Language Models
Text-to-Image Diffusion
Bias Discovery

Code references

kohpangwei/group_DRO

Best for: Research Scientist, AI Scientist, AI Security Engineer, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.