Detecting Temporally Localized Manipulations in Authentic Video Streams

2026-06-08 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

A new study addresses the challenge of detecting temporally localized manipulations in authentic video streams, where short manipulated segments are subtly inserted into otherwise genuine footage. It highlights limitations of existing datasets like LAV-DF, AV-Deepfake1M, and TVIL, which primarily focus on full video manipulation, face-centric deepfakes, or object removal. To counter this, a custom-curated dataset was constructed, comprising 100 pure authentic videos, 100 generated manipulation segments, and 100 partially manipulated videos. Inserted fake segments vary from 125 to 262 frames within total video durations of 300 to over 1300 frames. The research evaluates two detection approaches: a linear probe on DINOv3 features and an unsupervised DINOv3 feature similarity method. The unsupervised method achieved 83.00% global precision and 95.00% video-level accuracy on authentic streams, significantly outperforming the linear probe's maximum precision of 0.473.

Key takeaway

For multimedia forensics analysts or AI security engineers evaluating video authenticity, your existing deepfake detection tools, often trained on full deepfakes or object removal, will likely miss subtle, temporally localized manipulations. You should prioritize unsupervised temporal anomaly detection methods using self-supervised features like DINOv3. Implement content-adaptive thresholding to improve robustness and reduce false alarms in diverse video streams, ensuring higher precision in critical applications.

Key insights

Temporally localized video manipulations require specialized detection methods and datasets beyond existing deepfake benchmarks.

Principles

Prioritize precision in real-world deepfake detection.
Fixed thresholds struggle with diverse video characteristics.
Self-supervised features can capture temporal discontinuities.

Method

Extract DINOv3 features for each frame, compute cosine distance between consecutive pairs, then apply rolling window Z-score and dual thresholds (τ, δ) to identify manipulation boundaries.

In practice

Use DINOv3 features for unsupervised anomaly detection.
Implement content-adaptive thresholding for robustness.
Develop datasets with short, embedded manipulations.

Topics

Video Manipulation Detection
Deepfake Forensics
DINOv3 Features
Temporal Anomaly Detection
Dataset Curation
Multimedia Authenticity

Code references

OkanUmur/temporally-localized-video-manipulation-detection

Best for: Research Scientist, AI Scientist, Computer Vision Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.