Architectural Bias in Face Presentation Attack Detection: A Comparative Study of Vision Transformers and Convolutional Neural Networks

2026-06-16 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

A comparative empirical investigation into Face Presentation Attack Detection (PAD) systems reveals that Vision Transformer architectures significantly reduce demographic bias compared to convolutional neural networks. Experiments on the CASIA-SURF Cross-Ethnicity Face Anti-Spoofing (CeFA) dataset evaluated a Multimodal ViT-Tiny, a ResNet18 CNN baseline, and a pretrained DeiT-S. DeiT-S achieved the highest overall accuracy of 97.27% and the lowest Equal Error Rate (EER) of 0.86%, surpassing ResNet18's 90.15% accuracy. Notably, DeiT-S reduced the inter-ethnic ACER gap between African and East Asian subjects to 0.13%, an 83% reduction from a reported 0.75%. Furthermore, DeiT-S demonstrated a 3.6x generalization advantage on zero-shot Central Asian subjects, maintaining 2.89% BPCER compared to ResNet18's 10.44%. These findings suggest that architectural design, particularly pretrained Vision Transformers, influences cross-demographic fairness in PAD systems.

Key takeaway

For AI Security Engineers designing or deploying face Presentation Attack Detection systems, you should prioritize pretrained Vision Transformer architectures. These models, like DeiT-S, demonstrate superior accuracy (97.27%) and significantly reduce demographic performance disparities, achieving an 83% reduction in inter-ethnic ACER gaps. Your systems will also benefit from 3.6x better generalization to unseen demographic groups, enhancing overall security and equity in biometric authentication.

Key insights

Pretrained Vision Transformers significantly reduce demographic bias and improve generalization in face Presentation Attack Detection.

Principles

Pretrained Vision Transformers achieve superior PAD accuracy.
They produce smaller demographic performance gaps.
They generalize more equitably across unseen groups.

Method

Conducted a comparative empirical investigation of Multimodal ViT-Tiny, ResNet18, and pretrained DeiT-S architectures on the CASIA-SURF CeFA dataset.

In practice

Consider pretrained Vision Transformers for PAD systems.
Evaluate PAD systems for inter-ethnic ACER gaps.

Topics

Face Presentation Attack Detection
Vision Transformers
Demographic Bias
Biometric Security
DeiT-S
ResNet18

Best for: Research Scientist, AI Scientist, Computer Vision Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.