Evaluating Reference-Free Summarization Quality Metrics for Portuguese: A Study with Human Judgments in Financial News
Summary
A study presented at PROPOR 2026 evaluates reference-free summarization quality metrics for Portuguese financial news, addressing the lack of reliable evaluation methods in this specialized domain. Researchers João Victor Assaoka Ribeiro, Thomas Pires Correia, José Vitor Souza Cardoso Requena, and Lilian Berton compared Question Answering (QA) based metrics against a direct LLM-as-a-Judge baseline. Their pipeline incorporated Lexical, Binary, and Semantic (LLM-based) QA scoring methods, validated against a human ground truth of 50 news items annotated for Faithfulness and Completeness. The findings indicate that granular QA metrics significantly outperform the monolithic LLM-Judge for evaluating Completeness, with QA-Binary achieving a rank correlation of \u03c1 \u2248 0.49. For Faithfulness, the Semantic QA metric demonstrated a "super-human" ability to detect subtle hallucinations, such as temporal shifts, that human annotators missed.
Key takeaway
For research scientists developing or evaluating automatic summarization systems for specialized languages like Portuguese, you should prioritize granular Question Answering (QA) based metrics over monolithic LLM-as-a-Judge approaches. Specifically, integrate QA-Binary for robust completeness assessment and Semantic QA for detecting subtle faithfulness issues, as these methods offer superior correlation with human perception and even "super-human" detection capabilities for hallucinations in financial news.
Key insights
Decomposing summarization evaluation into atomic QA pairs surpasses holistic LLM-as-a-Judge methods for Portuguese financial news.
Principles
- Granular QA metrics improve completeness evaluation.
- Semantic QA detects subtle hallucinations.
- Human evaluation can have ceiling effects.
Method
The study proposes a pipeline comparing Lexical, Binary, and Semantic QA scoring methods against an LLM-as-a-Judge baseline, validated with human judgments on 50 Portuguese financial news summaries for faithfulness and completeness.
In practice
- Use QA-Binary for completeness evaluation.
- Employ Semantic QA to detect subtle hallucinations.
- Consider atomic QA pairs for specialized domains.
Topics
- Reference-Free Summarization
- Summarization Evaluation Metrics
- LLM-as-a-Judge
- Question Answering Metrics
- Portuguese Financial News
Best for: Research Scientist, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.