CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks

2024-11-20 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, extended

Summary

Alibaba Group and collaborating institutions introduce CulturALL, a new benchmark designed to assess the multilingual and multicultural competence of Large Language Models (LLMs) on "grounded tasks." Unlike existing benchmarks that focus on generic language understanding or cultural trivia, CulturALL evaluates LLMs' ability to reason within real-world, context-rich scenarios. The benchmark comprises 2,610 samples across 14 languages and 51 regions, distributed among 16 diverse topics. It was constructed using a human-AI collaborative framework, where expert annotators ensure factual accuracy and difficulty, while LLMs assist in generating and enriching scenarios. Initial experiments with 15 LLM configurations show that the best-performing model, gemini-2.5-pro_auto_true, achieved only 44.48% accuracy, indicating substantial room for improvement in LLMs' culturally grounded reasoning, especially for open-source models.

Key takeaway

For AI Engineers developing LLMs for global deployment, CulturALL highlights that current models struggle significantly with real-world, culturally-grounded tasks. You should prioritize enhancing multi-step reasoning capabilities and integrating effective web search tools, as these factors critically impact performance. Furthermore, consider the nuances of native language inputs, as direct translation can dilute cultural context and reduce accuracy.

Key insights

CulturALL benchmarks LLMs' multilingual, multicultural, and grounded reasoning on real-world tasks, revealing significant performance gaps.

Principles

Grounded tasks require fusing language, cultural knowledge, and contextual reasoning.
Web search significantly improves LLM performance on culturally-grounded tasks.
Native language prompts often outperform English translations in cultural context.

Method

CulturALL uses a human-LLM collaborative framework for benchmark creation, involving cultural topic sourcing, sample creation, difficulty enrichment via long-tail swaps and compositional examples, and quality control.

In practice

Prioritize LLMs with robust reasoning and web search for multicultural applications.
Develop LLMs to better leverage web search results for grounded tasks.
Focus on improving multi-step reasoning for complex cultural scenarios.

Topics

LLM Benchmarking
Multilingual Competence
Multicultural Competence
Grounded Tasks
Human-AI Data Generation

Code references

AIDC-AI/Marco-LLM

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.