OneFocus: Enabling Real-World X-ray Security Screening with a Unified Vision-Language Model

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Data Science & Analytics · Depth: Expert, quick

Summary

OneFocus is a unified Vision-Language Model (VLM) designed to enhance real-world X-ray security screening by improving contraband detection and visual understanding. Addressing the limitations of conventional detectors and the scarcity of high-quality X-ray image-caption data for VLMs, the system introduces several key components. MMXray, a meticulously curated benchmark, comprises 52,124 image-caption pairs across 28 fine-grained contraband classes. To simulate realistic occlusion, CleanDET, a synthesis dataset with 28 contraband categories, and AnyContraSyn, a controllable synthesis method, were developed. An extensible data curation pipeline, OnePipe, supports this effort. Built upon MMXray, OneFocus supports four core tasks: visual question answering, contraband localization, classification, and image understanding, achieving leading performance and robust cross-domain generalization in X-ray contraband understanding.

Key takeaway

For security screening professionals evaluating next-generation contraband detection, OneFocus demonstrates a robust approach to overcoming data limitations. You should consider integrating unified vision-language models, especially those trained on extensive, curated X-ray datasets like MMXray, to improve adaptability to new threats. This approach offers leading performance and strong cross-domain generalization, enhancing overall security posture.

Key insights

OneFocus, a unified VLM, utilizes a new large-scale X-ray dataset and synthesis methods for leading contraband detection.

Principles

Method

OneFocus is built on MMXray, a 52,124 image-caption pair benchmark. It uses CleanDET and AnyContraSyn for synthetic occlusion data, curated via OnePipe, to train a VLM for four core tasks.

In practice

Topics

Best for: Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.