MARCA: A Checklist-Based Benchmark for Multilingual Web Search
Summary
MARCA is a new bilingual benchmark designed to evaluate large language models (LLMs) on web-based information seeking in both English and Portuguese. It comprises 52 manually authored multi-entity questions, each with a manually validated checklist-style rubric to measure answer completeness and correctness. The benchmark assesses 14 different LLMs across two interaction settings: a Basic framework, which uses direct web search and scraping, and an Orchestrator framework, which enables task decomposition through delegated subagents. To account for stochasticity, each question is executed multiple times, and performance is reported with run-level uncertainty. Initial evaluations reveal significant performance disparities among models, demonstrate that orchestration frequently enhances coverage, and highlight considerable variability in cross-lingual transfer from English to Portuguese.
Key takeaway
For research scientists developing or deploying LLMs for information retrieval, you should integrate MARCA into your evaluation pipeline, especially when targeting multilingual applications. The benchmark's focus on Portuguese and its orchestration framework provide critical insights into model reliability and cross-lingual transfer, helping you identify robust models and optimize their deployment strategies for diverse linguistic contexts.
Key insights
MARCA evaluates LLM web search reliability using bilingual questions and checklist rubrics.
Principles
- Multilingual web search is underexplored.
- Orchestration improves LLM answer coverage.
Method
MARCA uses 52 multi-entity questions with checklist rubrics, evaluating 14 LLMs in Basic (direct search) and Orchestrator (subagent task decomposition) frameworks, with multiple runs to capture stochasticity.
In practice
- Test LLMs for cross-lingual transfer.
- Consider orchestration for complex queries.
Topics
- MARCA Benchmark
- Multilingual Web Search
- Large Language Models
- Information Seeking
- Task Orchestration
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.