Venus-DeFakerOne: Unified Fake Image Detection & Localization

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

DeFakerOne is a novel, data-centric foundation model designed for unified fake image detection and localization (FIDL), addressing the fragmentation in existing research despite the convergence of image forgery techniques. Released on May 15, 2026, DeFakerOne integrates InternVL2 and SAM2 architectures, enabling simultaneous image-level detection and pixel-level localization across diverse scenarios including AIGC, DeepFake, document, and natural image manipulations. The model was trained on a curated dataset of 12.5 million samples, covering various forensic domains and incorporating a closed-loop data generation pipeline for continuous adaptation. Extensive experiments demonstrate DeFakerOne's state-of-the-art performance, outperforming baselines on 39 forgery detection benchmarks and 9 localization benchmarks. It also exhibits superior robustness against real-world perturbations and advanced generators like GPT-Image-2, achieving 95.77% accuracy on the challenging GPT-Image-2-Bench.

Key takeaway

For Computer Vision Engineers and Research Scientists developing robust anti-forgery systems, DeFakerOne's unified approach highlights that simply increasing data volume is insufficient. You should prioritize balanced, operation-aware data composition and multi-granularity supervision, especially pixel-level masks for fine-grained manipulations. Ensure your visual backbones preserve high-resolution local evidence, as stronger compression in newer VLMs can dilute critical forensic artifacts, impacting detection and localization accuracy against advanced generative models like GPT-Image-2.

Key insights

Unified fake image detection and localization requires balanced multi-domain data and fine-grained supervision.

Principles

Method

DeFakerOne uses an MLLM-based perception-and-detection module (InternVL2) cascaded with a SAM2-based segmentation module. It employs dynamic VQA templates for detection and generates segmentation tokens for pixel-level localization.

In practice

Topics

Code references

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.