From Vision to Text: A Compact Multimodal Approach for Robust, Cross-Domain Presentation Attack Detection on ID Cards

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

This paper introduces a compact multimodal model for Presentation Attack Detection (PAD) on ID cards, addressing challenges like cross-domain shifts and data scarcity. The proposed SmolVLM2-based framework integrates generative and discriminative blocks, combining visual and textual data. Researchers evaluated this multimodal approach against deep learning (DenseNet-121) and unimodal (SigLIP-SO400M) baselines across genuine ID datasets from Chile and Mexico, and synthetic passports from Poland, Portugal, and Spain. Key findings indicate that zero-shot multimodal models perform poorly (EER > 45% on Chile), but supervised fine-tuning dramatically improves results. The generative structure of the multimodal model showed superior robustness to cross-country domain shifts on genuine data, achieving 5.99% EER on Mexico compared to DenseNet's 36.77%. Critically, all models exhibited unstable and often catastrophic performance on synthetic datasets, with BPCER reaching 100% on Spain, suggesting these synthetic data do not accurately reflect real-world PAD challenges.

Key takeaway

For Machine Learning Engineers developing ID card Presentation Attack Detection systems for international deployment, prioritize multimodal generative models. These models, especially larger variants like SmolVLM2-2.2B, demonstrate superior robustness to cross-country domain shifts on genuine ID data, achieving lower error rates than unimodal or deep learning baselines. Critically, you should be highly skeptical of synthetic datasets for evaluation, as they consistently fail to reflect real-world PAD challenges. Focus efforts on acquiring and utilizing diverse, genuine ID card data for reliable system development.

Key insights

Multimodal generative models improve ID card PAD generalization across real countries, but synthetic datasets are unreliable for evaluation.

Principles

Method

Adapting SmolVLM2, a compact vision-language model, into a binary PAD classifier using novel generative and discriminative structures. Fine-tuning employs Low-Rank Adaptation (LoRA) on the text decoder, while the vision encoder remains frozen.

In practice

Topics

Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.