When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models
Summary
NUMINA is a novel, training-free framework designed to improve numerical alignment in text-to-video diffusion models, which frequently struggle to generate the precise number of objects specified in a text prompt. This framework operates by first identifying prompt-layout inconsistencies through the selection of discriminative self- and cross-attention heads to create a countable latent layout. NUMINA then refines this layout conservatively and modulates cross-attention mechanisms to guide the video regeneration process. Evaluated on the new CountBench dataset, NUMINA significantly boosts counting accuracy by up to 7.4% on the Wan2.1-1.3B model, and by 4.9% and 5.5% on 5B and 14B models, respectively. The framework also enhances CLIP alignment while preserving temporal consistency, demonstrating that structural guidance is a practical complement to existing seed search and prompt enhancement techniques for achieving count-accurate text-to-video diffusion.
Key takeaway
For research scientists developing or deploying text-to-video diffusion models, integrating NUMINA offers a direct path to significantly enhance numerical accuracy in generated content. You should consider implementing this training-free framework to address common object counting failures, especially when precise visual representation of numerical prompts is critical. This approach complements existing prompt engineering and seed search strategies, providing a robust solution for more reliable video synthesis.
Key insights
NUMINA improves text-to-video diffusion models' object counting accuracy via a training-free, attention-guided framework.
Principles
- Structural guidance complements prompt enhancement.
- Attention heads reveal prompt-layout inconsistencies.
Method
NUMINA identifies prompt-layout inconsistencies using discriminative self- and cross-attention heads to derive a countable latent layout, then refines this layout and modulates cross-attention for guided regeneration.
In practice
- Use NUMINA for improved object counting accuracy.
- Apply structural guidance in text-to-video generation.
Topics
- Text-to-Video Diffusion Models
- Numerical Alignment
- NUMINA Framework
- Cross-Attention Modulation
- CountBench Dataset
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.