The Inadequacy of Automatic Evaluation Metrics in Question Answering: A Case-Study in Portuguese
Summary
A study investigated the limitations of automatic evaluation metrics for Question Answering (QA) systems, particularly in Portuguese. Researchers conducted a comparative analysis of traditional metrics like BLEU, ROUGE, and METEOR against newer approaches, including the "LLM-as-a-judge" paradigm. Experiments utilized the Pirá dataset, a Portuguese QA dataset, with four different Large Language Models (LLMs) generating answers. Human evaluators assessed these answers based on correctness, completeness, clarity, and relevance. The findings indicate that lexical metrics are inadequate for QA evaluation, often penalizing verbosity that human evaluators perceive as higher information density. This divergence highlights that traditional metrics fail to capture the balance between instruction adherence and semantic richness valued by native Portuguese speakers.
Key takeaway
For AI Engineers developing or deploying Question Answering systems, especially in languages like Portuguese, relying solely on traditional lexical metrics (BLEU, ROUGE, METEOR) is insufficient. You should integrate human evaluation or more advanced LLM-as-a-judge methods to accurately assess answer quality, particularly regarding semantic richness and information density, which traditional metrics often misinterpret as undesirable verbosity. This ensures your models meet user expectations for comprehensive and relevant responses.
Key insights
Traditional lexical metrics are insufficient for evaluating Question Answering quality, especially in non-English languages.
Principles
- Lexical metrics penalize verbosity.
- Humans favor information density over strict adherence.
Method
The study compared traditional and LLM-as-a-judge QA evaluation methods on the Portuguese Pirá dataset, using four LLMs and human assessment for correctness, completeness, clarity, and relevance.
In practice
- Prioritize human evaluation for nuanced QA.
- Consider LLM-as-a-judge for initial QA screening.
Topics
- Question Answering
- Automatic Evaluation Metrics
- LLM-as-a-judge
- Human Evaluation
- Pirá Dataset
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.