CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks
Summary
Alibaba Group and collaborating institutions introduce CulturALL, a new benchmark designed to assess the multilingual and multicultural competence of Large Language Models (LLMs) on "grounded tasks." Unlike existing benchmarks that focus on generic language understanding or cultural trivia, CulturALL evaluates LLMs' ability to reason within real-world, context-rich scenarios. The benchmark comprises 2,610 samples across 14 languages and 51 regions, distributed among 16 diverse topics. It was constructed using a human-AI collaborative framework, where expert annotators ensure factual accuracy and difficulty, while LLMs assist in generating and enriching scenarios. Initial experiments with 15 LLM configurations show that the best-performing model, gemini-2.5-pro_auto_true, achieved only 44.48% accuracy, indicating substantial room for improvement in LLMs' culturally grounded reasoning, especially for open-source models.
Key takeaway
For AI Engineers developing LLMs for global deployment, CulturALL highlights that current models struggle significantly with real-world, culturally-grounded tasks. You should prioritize enhancing multi-step reasoning capabilities and integrating effective web search tools, as these factors critically impact performance. Furthermore, consider the nuances of native language inputs, as direct translation can dilute cultural context and reduce accuracy.
Key insights
CulturALL benchmarks LLMs' multilingual, multicultural, and grounded reasoning on real-world tasks, revealing significant performance gaps.
Principles
- Grounded tasks require fusing language, cultural knowledge, and contextual reasoning.
- Web search significantly improves LLM performance on culturally-grounded tasks.
- Native language prompts often outperform English translations in cultural context.
Method
CulturALL uses a human-LLM collaborative framework for benchmark creation, involving cultural topic sourcing, sample creation, difficulty enrichment via long-tail swaps and compositional examples, and quality control.
In practice
- Prioritize LLMs with robust reasoning and web search for multicultural applications.
- Develop LLMs to better leverage web search results for grounded tasks.
- Focus on improving multi-step reasoning for complex cultural scenarios.
Topics
- LLM Benchmarking
- Multilingual Competence
- Multicultural Competence
- Grounded Tasks
- Human-AI Data Generation
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.