When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models

2026-04-09 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

NUMINA is a novel, training-free framework designed to improve numerical alignment in text-to-video diffusion models, which frequently struggle to generate the precise number of objects specified in a text prompt. This framework operates by first identifying prompt-layout inconsistencies through the selection of discriminative self- and cross-attention heads to create a countable latent layout. NUMINA then refines this layout conservatively and modulates cross-attention mechanisms to guide the video regeneration process. Evaluated on the new CountBench dataset, NUMINA significantly boosts counting accuracy by up to 7.4% on the Wan2.1-1.3B model, and by 4.9% and 5.5% on 5B and 14B models, respectively. The framework also enhances CLIP alignment while preserving temporal consistency, demonstrating that structural guidance is a practical complement to existing seed search and prompt enhancement techniques for achieving count-accurate text-to-video diffusion.

Key takeaway

For research scientists developing or deploying text-to-video diffusion models, integrating NUMINA offers a direct path to significantly enhance numerical accuracy in generated content. You should consider implementing this training-free framework to address common object counting failures, especially when precise visual representation of numerical prompts is critical. This approach complements existing prompt engineering and seed search strategies, providing a robust solution for more reliable video synthesis.

Key insights

NUMINA improves text-to-video diffusion models' object counting accuracy via a training-free, attention-guided framework.

Principles

Structural guidance complements prompt enhancement.
Attention heads reveal prompt-layout inconsistencies.

Method

NUMINA identifies prompt-layout inconsistencies using discriminative self- and cross-attention heads to derive a countable latent layout, then refines this layout and modulates cross-attention for guided regeneration.

In practice

Use NUMINA for improved object counting accuracy.
Apply structural guidance in text-to-video generation.

Topics

Text-to-Video Diffusion Models
Numerical Alignment
NUMINA Framework
Cross-Attention Modulation
CountBench Dataset

Code references

H-EmbodVis/NUMINA

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.