CULTURESCORE: Evaluating Cultural Faithfulness in Video Generation Models

2026-03-31 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, AI Ethics & Fairness · Depth: Expert, extended

Summary

CultureScore is a new compositional evaluation framework designed to assess cultural faithfulness in video generation models like Veo 3.1, LTX-2, and Wan 2.2. It decomposes cultural representation into three dimensions: Identity, Context, and Behavior. The framework was operationalized through an evaluation suite spanning 10 countries, generating 6,180 videos across the three state-of-the-art models. Findings indicate that no current model achieves culturally faithful video generation, with the best-performing model reaching only 56.8% overall CultureScore. Behavior proved the most challenging dimension, remaining below 52% across all models. Crucially, human preference rankings inversely correlated with traditional visual quality metrics like VideoScore, highlighting cultural faithfulness as an essential criterion.

Key takeaway

For AI Scientists and Machine Learning Engineers developing video generation models, you must prioritize cultural faithfulness beyond mere visual quality. Current metrics like VideoScore can actively mislead, as models excelling in perceptual quality often fail culturally. Integrate decomposed evaluation frameworks like CultureScore early in your development cycle, focusing on improving "Behavior" dimensions. You should also ensure models internalize cultural concepts rather than relying solely on explicit geographic identifiers to avoid systematic biases.

Key insights

Video generation models lack cultural faithfulness, especially in depicting behaviors, despite high visual quality.

Principles

Cultural faithfulness requires decomposed evaluation.
Perceptual quality metrics can mislead cultural assessment.
Models heavily rely on explicit geographic cues.

Method

CultureScore decomposes prompts into Identity, Behavior, and Context, generates videos, then uses VLM-based QA to quantify faithfulness in each dimension, aggregating scores.

In practice

Augment prompts with explicit cultural details.
Test models for implicit cultural knowledge.

Topics

Video Generation Models
Cultural Faithfulness
AI Bias
Evaluation Metrics
Vision-Language Models
Prompt Engineering

Best for: Research Scientist, Computer Vision Engineer, AI Product Manager, AI Scientist, Machine Learning Engineer, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.