MADE: Beyond Scoring via a Multilingual Agentic Diagnosing Engine for Fine-Grained Evaluation Insights
Summary
MADE, a Multilingual Agentic Diagnosing Engine, addresses the challenge of extracting actionable insights from complex multilingual Large Language Model (LLM) benchmarks, which often provide scores without clear diagnostic understanding. Developed by Huawei, China, MADE decomposes post-evaluation analysis into five specialized agent roles: Planner, Evidence Analyst, Case Analyst, Language Reflector, and Reporter. This engine was evaluated using a 54-query by 15-language diagnostic set, applied to a large-scale multilingual evaluation substrate comprising 33 model families, 11 benchmarks, 26 languages, 34 cultures, and 8.66 million evaluation records. Experiments demonstrated that MADE significantly improves diagnosis report quality, outperforming the strongest shared baseline by 47% and receiving preference from human multilingual experts in 87.9% of pairwise comparisons. The system further identified four actionable findings related to LLM deployment, iteration, and cross-cultural performance issues.
Key takeaway
For MLOps Engineers or AI Scientists deploying multilingual LLMs, relying solely on aggregate benchmark scores is insufficient. You should adopt a fine-grained diagnostic approach like MADE's agentic workflow to uncover specific language, cultural, or task-level failure modes. This shifts your focus from "where" models rank to "why" they fail, enabling targeted remediation and informed model selection for diverse global users.
Key insights
Multilingual LLM evaluation requires fine-grained, agentic diagnosis to transform scores into actionable insights, addressing cultural and deployment realities.
Principles
- Decompose complex analysis into specialized agent roles.
- Ground all diagnostic claims in verifiable evidence.
- Explicitly integrate multilingual and cultural reflection.
Method
MADE employs a five-role agentic workflow: Planner, Evidence Analyst, Case Analyst, Language Reflector, and Reporter. It uses deterministic diagnostic tools and a structured 3-dimensional query taxonomy to generate grounded reports.
In practice
- Use role-specialized agents for complex LLM evaluation.
- Implement deterministic tools for verifiable claims.
- Audit for English-centric assumptions in multilingual contexts.
Topics
- Multilingual LLMs
- LLM Evaluation
- Agentic AI Systems
- Cross-Cultural AI
- Diagnostic Tools
- MLOps
Code references
Best for: Research Scientist, NLP Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.