A Structured Benchmark for Text-Guided Anomaly Detection: When Language Stops Conditioning the Decision
Summary
A new benchmark, Text-Guided Anomaly Detection (TGAD), has been introduced to rigorously evaluate multimodal vision-language models in industrial anomaly detection. Existing evaluation protocols often fail to measure if language genuinely conditions decisions, obscuring whether performance gains come from text guidance or strong visual features. TGAD features three scenarios: a prompt-sensitivity test on MVTec AD, a component-tagged MVTec AD requiring assessment restriction, and the novel Assembled Panel Dataset (APD) for realistic defect-type and component-location knowledge. Evaluations across generative, training-free discriminative, and embedding-adaptive discriminative models reveal that the textual interface conditions decisions only superficially. For instance, a generative model's I-AUROC dropped from 97.4 to 82.6 when the object noun was removed. Component instructions failed to constrain decisions effectively (from 90.3 to 66.3). On APD, image-level discrimination collapsed to 71.2, 50.5, and 31.5, sometimes below chance. These findings suggest current benchmarks overstate text-guided abilities, necessitating structured protocols for reliable industrial deployment.
Key takeaway
For Machine Learning Engineers developing industrial anomaly detection systems, you should critically re-evaluate current multimodal model performance. Your existing benchmarks likely overstate text-guided capabilities, as language often fails to genuinely condition decisions. Implement structured evaluation protocols like TGAD to verify true language control, especially for component-level instructions and combined defect/location tasks. This will ensure your deployed systems are reliably controllable through language, preventing unexpected failures in production.
Key insights
Current multimodal anomaly detection systems exhibit superficial text guidance, often relying more on visual features than language conditioning.
Principles
- Standard benchmarks overstate text-guided capabilities.
- Language conditioning requires specific evaluation protocols.
- Object noun removal significantly impacts generative model performance.
Method
The TGAD benchmark progressively increases language's functional role across three scenarios: prompt-sensitivity, component-tagged assessment, and combined defect-type/component-location knowledge.
In practice
- Test prompt sensitivity by removing key nouns.
- Evaluate component-level instruction adherence.
- Use datasets requiring both defect and location knowledge.
Topics
- Text-Guided Anomaly Detection
- Multimodal Vision-Language Models
- Industrial Anomaly Detection
- MVTec AD
- Assembled Panel Dataset
- Benchmark Evaluation
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.