ReFree: Towards Realistic Co-Speech Video Generation via Reward-Free RL and Multilevel Speech Guidance

2026-06-11 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

ReFree-S2V is a novel flow-matching speech-to-portrait animation framework designed to generate realistic co-speech videos, addressing the challenge of balancing accurate lip synchronization with expressive facial and head movements. Existing methods often compromise between precise phoneme-to-lip alignment and dynamic expressions, resulting in either rigid or poorly synchronized animations. ReFree-S2V builds on a pretrained video generation model, incorporating a multi-level speech representation that captures both phonetic and prosodic information at local and global granularities. These representations are selectively integrated into transformer blocks using learnable level selectors to ensure both accurate lip synchronization and natural expressive motion. Furthermore, the framework introduces a reward-free reinforcement learning scheme within its flow-matching training to prevent perceptually implausible head movements, eliminating the need for handcrafted metrics or costly human preference annotations. Experiments demonstrate ReFree-S2V achieves state-of-the-art performance, surpassing current methods in quantitative lip-sync accuracy and qualitative human evaluations of naturalness and expressivity.

Key takeaway

For Machine Learning Engineers developing speech-driven animation, ReFree-S2V offers a robust approach to overcome the trade-off between lip-sync accuracy and expressive motion. You should consider integrating multi-level speech representations and reward-free reinforcement learning into your flow-matching frameworks. This method eliminates reliance on costly human annotations or handcrafted metrics, potentially streamlining your development process while achieving superior naturalness and expressivity in generated co-speech videos.

Key insights

ReFree-S2V balances precise lip-sync and expressive motion in co-speech video generation using multi-level speech guidance and reward-free RL.

Principles

Multi-level speech representations enhance realism.
Reward-free RL can prevent implausible motion.
Selective feature injection improves control.

Method

ReFree-S2V uses flow-matching on a pretrained video model, injecting multi-level speech representations via learnable selectors, and employs reward-free RL for natural head movements.

In practice

Generate highly realistic talking avatars.
Improve virtual assistant expressivity.
Enhance digital character animation.

Topics

Co-Speech Video Generation
Speech-to-Portrait Animation
Flow Matching
Reward-Free Reinforcement Learning
Multi-level Speech Guidance
Lip Synchronization

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.