When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

NUMINA is a novel, training-free framework designed to improve numerical alignment in text-to-video diffusion models, which frequently struggle to generate the precise number of objects specified in a text prompt. This framework operates by first identifying prompt-layout inconsistencies through the selection of discriminative self- and cross-attention heads to create a countable latent layout. NUMINA then refines this layout conservatively and modulates cross-attention mechanisms to guide the video regeneration process. Evaluated on the new CountBench dataset, NUMINA significantly boosts counting accuracy by up to 7.4% on the Wan2.1-1.3B model, and by 4.9% and 5.5% on 5B and 14B models, respectively. The framework also enhances CLIP alignment while preserving temporal consistency, demonstrating that structural guidance is a practical complement to existing seed search and prompt enhancement techniques for achieving count-accurate text-to-video diffusion.

Key takeaway

For research scientists developing or deploying text-to-video diffusion models, integrating NUMINA offers a direct path to significantly enhance numerical accuracy in generated content. You should consider implementing this training-free framework to address common object counting failures, especially when precise visual representation of numerical prompts is critical. This approach complements existing prompt engineering and seed search strategies, providing a robust solution for more reliable video synthesis.

Key insights

NUMINA improves text-to-video diffusion models' object counting accuracy via a training-free, attention-guided framework.

Principles

Method

NUMINA identifies prompt-layout inconsistencies using discriminative self- and cross-attention heads to derive a countable latent layout, then refines this layout and modulates cross-attention for guided regeneration.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.