EMO-BOOST: Emotion-Augmented Audio-Visual Features for Improved Generalization in Deepfake Detection
Summary
The Emo-Boost framework enhances deepfake detection by integrating emotion-augmented audio-visual features, specifically addressing generalization to unseen manipulation types. This multimodal system fuses an off-the-shelf RGB- and acoustic-focused deepfake detector with EmoForensics, a novel emotion-based detector. EmoForensics models intra- and inter-modal temporal consistency in emotion representations using frozen visual (POSTER) and audio (emotion2vec) emotion encoders. The combined Emo-Boost system improves the average cross-manipulation generalization AUC by 2.1% on the FakeAVCeleb dataset, outperforming the previous best method, SIMBA. While competitive on DeepSpeak v2, Emo-Boost demonstrates that emotion-based signals provide complementary cues to low-level feature detectors, enhancing robustness against novel deepfake generation techniques.
Key takeaway
For AI Security Engineers developing robust deepfake detection systems, consider integrating high-level semantic cues like emotion consistency. Emo-Boost demonstrates that fusing emotion-based signals with low-level feature detectors significantly improves generalization against novel deepfake manipulations, as shown by a 2.1% AUC increase on FakeAVCeleb. This approach offers a path to more resilient detection, especially when facing rapidly evolving generative AI techniques.
Key insights
Emotion-augmented audio-visual features improve deepfake detection generalization to unseen manipulations.
Principles
- High-level semantic cues support low-level deepfake detection.
- Temporal consistency in emotion signals indicates authenticity.
- Complementary signals enhance deepfake generalization.
Method
Emo-Boost fuses an off-the-shelf deepfake detector with EmoForensics, which extracts emotion representations via frozen encoders and models intra- and inter-modal temporal consistency using transformers and contrastive loss.
In practice
- Integrate emotion-based features with existing deepfake detectors.
- Utilize frozen emotion encoders to avoid manipulation distribution shifts.
Topics
- Deepfake Detection
- Multimodal AI
- Emotion Recognition
- Cross-Manipulation Generalization
- Audio-Visual Forensics
- Emo-Boost
Code references
Best for: Research Scientist, AI Scientist, AI Security Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.