Causal_QA.PT: A Human–LLM Co-Curated Benchmark for Causal Question Answering in Portuguese Language

2026-04-12 · Source: Paper Index on ACL Anthology · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

Causal_QA.PT is a new human–LLM co-curated benchmark designed for causal question answering in Portuguese, addressing a significant gap in high-quality evaluation resources for non-English causal reasoning. The dataset was developed using a hybrid human–LLM process involving targeted generation, validation, and evaluation, and is structured according to the PEARL causal typology. Researchers evaluated Large Language Models' ability to answer Portuguese causal questions and investigated the impact of explicitly including causal class information in prompt design. Results indicate that current LLMs can generate high-quality causal responses in Portuguese, with GPT-5 Mini showing strong performance in judgment-based evaluations. Providing explicit causal class information offered question- and model-dependent benefits, especially for interventional and counterfactual questions. The study also noted that human reference answers were not consistently superior, highlighting the need for meticulous benchmark curation.

Key takeaway

For research scientists developing or evaluating LLMs for non-English languages, Causal_QA.PT offers a vital resource for assessing causal reasoning in Portuguese. You should consider integrating this benchmark into your evaluation pipelines and explore the benefits of explicitly providing causal class information in your prompt engineering, particularly for interventional and counterfactual questions, to potentially enhance model performance.

Key insights

A new human-LLM co-curated benchmark improves causal reasoning evaluation for Portuguese LLMs.

Principles

Hybrid human-LLM curation enhances benchmark quality.
Explicit causal class data improves LLM performance.
Human references are not always superior to LLM outputs.

Method

The Causal_QA.PT benchmark was developed using a hybrid human–LLM process with targeted generation, validation, and evaluation procedures, organized by the PEARL causal typology.

In practice

Use Causal_QA.PT for Portuguese causal QA evaluation.
Consider GPT-5 Mini for Portuguese causal tasks.
Experiment with explicit causal class prompting.

Topics

Causal_QA.PT
Causal Question Answering
Portuguese Language
Large Language Models
PEARL Causal Typology

Best for: Research Scientist, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.