MARCA: A Checklist-Based Benchmark for Multilingual Web Search

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

MARCA is a new bilingual benchmark designed to evaluate large language models (LLMs) on web-based information seeking in both English and Portuguese. It comprises 52 manually authored multi-entity questions, each with a manually validated checklist-style rubric to measure answer completeness and correctness. The benchmark assesses 14 different LLMs across two interaction settings: a Basic framework, which uses direct web search and scraping, and an Orchestrator framework, which enables task decomposition through delegated subagents. To account for stochasticity, each question is executed multiple times, and performance is reported with run-level uncertainty. Initial evaluations reveal significant performance disparities among models, demonstrate that orchestration frequently enhances coverage, and highlight considerable variability in cross-lingual transfer from English to Portuguese.

Key takeaway

For research scientists developing or deploying LLMs for information retrieval, you should integrate MARCA into your evaluation pipeline, especially when targeting multilingual applications. The benchmark's focus on Portuguese and its orchestration framework provide critical insights into model reliability and cross-lingual transfer, helping you identify robust models and optimize their deployment strategies for diverse linguistic contexts.

Key insights

MARCA evaluates LLM web search reliability using bilingual questions and checklist rubrics.

Principles

Method

MARCA uses 52 multi-entity questions with checklist rubrics, evaluating 14 LLMs in Basic (direct search) and Orchestrator (subagent task decomposition) frameworks, with multiple runs to capture stochasticity.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.