EMO-BOOST: Emotion-Augmented Audio-Visual Features for Improved Generalization in Deepfake Detection

2025-09-26 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

The Emo-Boost framework enhances deepfake detection by integrating emotion-augmented audio-visual features, specifically addressing generalization to unseen manipulation types. This multimodal system fuses an off-the-shelf RGB- and acoustic-focused deepfake detector with EmoForensics, a novel emotion-based detector. EmoForensics models intra- and inter-modal temporal consistency in emotion representations using frozen visual (POSTER) and audio (emotion2vec) emotion encoders. The combined Emo-Boost system improves the average cross-manipulation generalization AUC by 2.1% on the FakeAVCeleb dataset, outperforming the previous best method, SIMBA. While competitive on DeepSpeak v2, Emo-Boost demonstrates that emotion-based signals provide complementary cues to low-level feature detectors, enhancing robustness against novel deepfake generation techniques.

Key takeaway

For AI Security Engineers developing robust deepfake detection systems, consider integrating high-level semantic cues like emotion consistency. Emo-Boost demonstrates that fusing emotion-based signals with low-level feature detectors significantly improves generalization against novel deepfake manipulations, as shown by a 2.1% AUC increase on FakeAVCeleb. This approach offers a path to more resilient detection, especially when facing rapidly evolving generative AI techniques.

Key insights

Emotion-augmented audio-visual features improve deepfake detection generalization to unseen manipulations.

Principles

High-level semantic cues support low-level deepfake detection.
Temporal consistency in emotion signals indicates authenticity.
Complementary signals enhance deepfake generalization.

Method

Emo-Boost fuses an off-the-shelf deepfake detector with EmoForensics, which extracts emotion representations via frozen encoders and models intra- and inter-modal temporal consistency using transformers and contrastive loss.

In practice

Integrate emotion-based features with existing deepfake detectors.
Utilize frozen emotion encoders to avoid manipulation distribution shifts.

Topics

Deepfake Detection
Multimodal AI
Emotion Recognition
Cross-Manipulation Generalization
Audio-Visual Forensics
Emo-Boost

Code references

JoeLeelyf/OpenAVFF

Best for: Research Scientist, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.