MIBURI: Towards Expressive Interactive Gesture Synthesis

2026-03-03 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Human-Computer Interaction · Depth: Expert, quick

Summary

MIBURI is introduced as the first online, causal framework designed for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue. This system addresses limitations in current Embodied Conversational Agents (ECAs), which often produce rigid or low-diversity motions, and generative co-speech gesture synthesis methods that rely on future speech context and long run-times. MIBURI utilizes body-part aware gesture codecs to encode hierarchical motion details into multi-level discrete tokens. These tokens are then autoregressively generated by a two-dimensional causal framework, conditioned on LLM-based speech-text embeddings, to model both temporal dynamics and part-level motion hierarchy in real time. The framework incorporates auxiliary objectives to promote expressive and diverse gestures while avoiding static poses, with comparative evaluations demonstrating its ability to produce natural and contextually aligned gestures.

Key takeaway

For research scientists developing Embodied Conversational Agents, MIBURI offers a novel approach to overcome limitations in gesture expressiveness and real-time synchronization. You should explore integrating causal, body-part aware gesture codecs and LLM-based speech-text embeddings to achieve more natural and diverse human-like interactions, moving beyond rigid or context-dependent motion generation.

Key insights

MIBURI is a causal, real-time framework for generating expressive, synchronized full-body gestures and facial expressions for conversational agents.

Principles

Causal generation enables real-time interaction.
Hierarchical motion encoding improves expressiveness.

Method

MIBURI employs body-part aware gesture codecs to create multi-level discrete tokens, which are then autoregressively generated by a 2D causal framework conditioned on LLM-based speech-text embeddings.

In practice

Integrate LLM-based speech-text embeddings.
Utilize auxiliary objectives for gesture diversity.

Topics

Embodied Conversational Agents
Gesture Synthesis
Real-time AI
Causal AI Frameworks
LLM Embeddings

Best for: Research Scientist, AI Researcher, AI Scientist, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.