The Inadequacy of Automatic Evaluation Metrics in Question Answering: A Case-Study in Portuguese

2026-04-12 · Source: Paper Index on ACL Anthology · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

A study investigated the limitations of automatic evaluation metrics for Question Answering (QA) systems, particularly in Portuguese. Researchers conducted a comparative analysis of traditional metrics like BLEU, ROUGE, and METEOR against newer approaches, including the "LLM-as-a-judge" paradigm. Experiments utilized the Pirá dataset, a Portuguese QA dataset, with four different Large Language Models (LLMs) generating answers. Human evaluators assessed these answers based on correctness, completeness, clarity, and relevance. The findings indicate that lexical metrics are inadequate for QA evaluation, often penalizing verbosity that human evaluators perceive as higher information density. This divergence highlights that traditional metrics fail to capture the balance between instruction adherence and semantic richness valued by native Portuguese speakers.

Key takeaway

For AI Engineers developing or deploying Question Answering systems, especially in languages like Portuguese, relying solely on traditional lexical metrics (BLEU, ROUGE, METEOR) is insufficient. You should integrate human evaluation or more advanced LLM-as-a-judge methods to accurately assess answer quality, particularly regarding semantic richness and information density, which traditional metrics often misinterpret as undesirable verbosity. This ensures your models meet user expectations for comprehensive and relevant responses.

Key insights

Traditional lexical metrics are insufficient for evaluating Question Answering quality, especially in non-English languages.

Principles

Lexical metrics penalize verbosity.
Humans favor information density over strict adherence.

Method

The study compared traditional and LLM-as-a-judge QA evaluation methods on the Portuguese Pirá dataset, using four LLMs and human assessment for correctness, completeness, clarity, and relevance.

In practice

Prioritize human evaluation for nuanced QA.
Consider LLM-as-a-judge for initial QA screening.

Topics

Question Answering
Automatic Evaluation Metrics
LLM-as-a-judge
Human Evaluation
Pirá Dataset

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.