MADE: Beyond Scoring via a Multilingual Agentic Diagnosing Engine for Fine-Grained Evaluation Insights

2026-06-08 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

MADE, a Multilingual Agentic Diagnosing Engine, addresses the challenge of extracting actionable insights from complex multilingual Large Language Model (LLM) benchmarks, which often provide scores without clear diagnostic understanding. Developed by Huawei, China, MADE decomposes post-evaluation analysis into five specialized agent roles: Planner, Evidence Analyst, Case Analyst, Language Reflector, and Reporter. This engine was evaluated using a 54-query by 15-language diagnostic set, applied to a large-scale multilingual evaluation substrate comprising 33 model families, 11 benchmarks, 26 languages, 34 cultures, and 8.66 million evaluation records. Experiments demonstrated that MADE significantly improves diagnosis report quality, outperforming the strongest shared baseline by 47% and receiving preference from human multilingual experts in 87.9% of pairwise comparisons. The system further identified four actionable findings related to LLM deployment, iteration, and cross-cultural performance issues.

Key takeaway

For MLOps Engineers or AI Scientists deploying multilingual LLMs, relying solely on aggregate benchmark scores is insufficient. You should adopt a fine-grained diagnostic approach like MADE's agentic workflow to uncover specific language, cultural, or task-level failure modes. This shifts your focus from "where" models rank to "why" they fail, enabling targeted remediation and informed model selection for diverse global users.

Key insights

Multilingual LLM evaluation requires fine-grained, agentic diagnosis to transform scores into actionable insights, addressing cultural and deployment realities.

Principles

Decompose complex analysis into specialized agent roles.
Ground all diagnostic claims in verifiable evidence.
Explicitly integrate multilingual and cultural reflection.

Method

MADE employs a five-role agentic workflow: Planner, Evidence Analyst, Case Analyst, Language Reflector, and Reporter. It uses deterministic diagnostic tools and a structured 3-dimensional query taxonomy to generate grounded reports.

In practice

Use role-specialized agents for complex LLM evaluation.
Implement deterministic tools for verifiable claims.
Audit for English-centric assumptions in multilingual contexts.

Topics

Multilingual LLMs
LLM Evaluation
Agentic AI Systems
Cross-Cultural AI
Diagnostic Tools
MLOps

Code references

Best for: Research Scientist, NLP Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.