DRIFT: From Robustness Gaps to Invariance Manifolds for AI-Generated Image Detection

2026-06-08 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Image Processing · Depth: Expert, extended

Summary

DRIFT is a novel AI-generated image detection framework developed by Samsung Research Institute that addresses limitations of existing training-free methods by learning a structured invariance manifold of real images. It utilizes a frozen DINOv2 ViT-B/14 backbone with two lightweight projection heads (robust and fragile) to decompose representation space. The robust subspace suppresses variations from physically plausible transformations, while the fragile subspace retains sensitivity to edit-like perturbations. A structured ordering margin of γ=0.3 enforces hierarchical separation, enabling detection as a margin-violation test. The framework incorporates an EMA teacher with momentum 0.996 and a reconstruction anchor (weight 0.1) to stabilize training on real-only data. Experiments show strong open-world generalization, achieving a mean ACC/AP of approximately 97.8/99.8 on ForenSynth, consistently high accuracy on Diffusion-6cls, and 93.2% ACC / 92.0% AP on Gemini and 94.8% ACC / 95.0% AP on ChatGPT for PromptWorld-1K. It also provides interpretable patch-wise localization heatmaps.

Key takeaway

For Machine Learning Engineers developing robust AI-generated image detectors, you should consider adopting a structured invariance learning approach. This method, which explicitly models real-image manifolds using robust and fragile representation subspaces, offers superior open-world generalization compared to fixed robustness gap techniques. Implement an EMA teacher and reconstruction anchor to stabilize training on real-only datasets, and utilize patch-wise drift maps for both detection and interpretable localization of synthetic content.

Key insights

AI-generated image detection improves by learning a structured invariance manifold of real images using robust and fragile representation subspaces.

Principles

Decompose representation space into robust and fragile subspaces.
Enforce hierarchical ordering between physical invariance and edit variability.
Stabilize one-class invariance learning with EMA and reconstruction.

Method

Train projection heads on a frozen VFM using real-only data, enforcing robust invariance, fragile sensitivity, and an ordering margin with EMA and reconstruction losses. Detect fakes via margin violation.

In practice

Use patch-wise drift maps for interpretable AI-generated image localization.
Aggregate patch scores with Top-k median for robust global detection.

Topics

AI-Generated Image Detection
Invariance Learning
Representation Learning
Vision Foundation Models
Digital Forensics
Image Authenticity

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.