OneFocus: Enabling Real-World X-ray Security Screening with a Unified Vision-Language Model
Summary
OneFocus is a unified Vision-Language Model (VLM) designed to enhance real-world X-ray security screening by improving contraband detection and visual understanding. Addressing the limitations of conventional detectors and the scarcity of high-quality X-ray image-caption data for VLMs, the system introduces several key components. MMXray, a meticulously curated benchmark, comprises 52,124 image-caption pairs across 28 fine-grained contraband classes. To simulate realistic occlusion, CleanDET, a synthesis dataset with 28 contraband categories, and AnyContraSyn, a controllable synthesis method, were developed. An extensible data curation pipeline, OnePipe, supports this effort. Built upon MMXray, OneFocus supports four core tasks: visual question answering, contraband localization, classification, and image understanding, achieving leading performance and robust cross-domain generalization in X-ray contraband understanding.
Key takeaway
For security screening professionals evaluating next-generation contraband detection, OneFocus demonstrates a robust approach to overcoming data limitations. You should consider integrating unified vision-language models, especially those trained on extensive, curated X-ray datasets like MMXray, to improve adaptability to new threats. This approach offers leading performance and strong cross-domain generalization, enhancing overall security posture.
Key insights
OneFocus, a unified VLM, utilizes a new large-scale X-ray dataset and synthesis methods for leading contraband detection.
Principles
- Data scarcity for VLMs in X-ray security is a critical gap.
- Synthetic data can enrich real-world occlusion patterns.
- Unified VLMs can perform multiple vision-language tasks.
Method
OneFocus is built on MMXray, a 52,124 image-caption pair benchmark. It uses CleanDET and AnyContraSyn for synthetic occlusion data, curated via OnePipe, to train a VLM for four core tasks.
In practice
- Curate large-scale, fine-grained image-caption datasets for VLMs.
- Synthesize occlusion patterns to enhance model robustness.
- Develop unified VLMs for multi-task security screening.
Topics
- X-ray Security Screening
- Vision-Language Models
- Contraband Detection
- MMXray Dataset
- Data Synthesis
- OneFocus Model
Best for: Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.