Evaluating Reference-Free Summarization Quality Metrics for Portuguese: A Study with Human Judgments in Financial News

2026-04-12 · Source: Paper Index on ACL Anthology · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, medium

Summary

A study presented at PROPOR 2026 evaluates reference-free summarization quality metrics for Portuguese financial news, addressing the lack of reliable evaluation methods in this specialized domain. Researchers João Victor Assaoka Ribeiro, Thomas Pires Correia, José Vitor Souza Cardoso Requena, and Lilian Berton compared Question Answering (QA) based metrics against a direct LLM-as-a-Judge baseline. Their pipeline incorporated Lexical, Binary, and Semantic (LLM-based) QA scoring methods, validated against a human ground truth of 50 news items annotated for Faithfulness and Completeness. The findings indicate that granular QA metrics significantly outperform the monolithic LLM-Judge for evaluating Completeness, with QA-Binary achieving a rank correlation of \u03c1 \u2248 0.49. For Faithfulness, the Semantic QA metric demonstrated a "super-human" ability to detect subtle hallucinations, such as temporal shifts, that human annotators missed.

Key takeaway

For research scientists developing or evaluating automatic summarization systems for specialized languages like Portuguese, you should prioritize granular Question Answering (QA) based metrics over monolithic LLM-as-a-Judge approaches. Specifically, integrate QA-Binary for robust completeness assessment and Semantic QA for detecting subtle faithfulness issues, as these methods offer superior correlation with human perception and even "super-human" detection capabilities for hallucinations in financial news.

Key insights

Decomposing summarization evaluation into atomic QA pairs surpasses holistic LLM-as-a-Judge methods for Portuguese financial news.

Principles

Granular QA metrics improve completeness evaluation.
Semantic QA detects subtle hallucinations.
Human evaluation can have ceiling effects.

Method

The study proposes a pipeline comparing Lexical, Binary, and Semantic QA scoring methods against an LLM-as-a-Judge baseline, validated with human judgments on 50 Portuguese financial news summaries for faithfulness and completeness.

In practice

Use QA-Binary for completeness evaluation.
Employ Semantic QA to detect subtle hallucinations.
Consider atomic QA pairs for specialized domains.

Topics

Reference-Free Summarization
Summarization Evaluation Metrics
LLM-as-a-Judge
Question Answering Metrics
Portuguese Financial News

Best for: Research Scientist, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.