A Structured Benchmark for Text-Guided Anomaly Detection: When Language Stops Conditioning the Decision

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

A new benchmark, Text-Guided Anomaly Detection (TGAD), has been introduced to rigorously evaluate multimodal vision-language models in industrial anomaly detection. Existing evaluation protocols often fail to measure if language genuinely conditions decisions, obscuring whether performance gains come from text guidance or strong visual features. TGAD features three scenarios: a prompt-sensitivity test on MVTec AD, a component-tagged MVTec AD requiring assessment restriction, and the novel Assembled Panel Dataset (APD) for realistic defect-type and component-location knowledge. Evaluations across generative, training-free discriminative, and embedding-adaptive discriminative models reveal that the textual interface conditions decisions only superficially. For instance, a generative model's I-AUROC dropped from 97.4 to 82.6 when the object noun was removed. Component instructions failed to constrain decisions effectively (from 90.3 to 66.3). On APD, image-level discrimination collapsed to 71.2, 50.5, and 31.5, sometimes below chance. These findings suggest current benchmarks overstate text-guided abilities, necessitating structured protocols for reliable industrial deployment.

Key takeaway

For Machine Learning Engineers developing industrial anomaly detection systems, you should critically re-evaluate current multimodal model performance. Your existing benchmarks likely overstate text-guided capabilities, as language often fails to genuinely condition decisions. Implement structured evaluation protocols like TGAD to verify true language control, especially for component-level instructions and combined defect/location tasks. This will ensure your deployed systems are reliably controllable through language, preventing unexpected failures in production.

Key insights

Current multimodal anomaly detection systems exhibit superficial text guidance, often relying more on visual features than language conditioning.

Principles

Method

The TGAD benchmark progressively increases language's functional role across three scenarios: prompt-sensitivity, component-tagged assessment, and combined defect-type/component-location knowledge.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.