Causal_QA.PT: A Human–LLM Co-Curated Benchmark for Causal Question Answering in Portuguese Language
Summary
Causal_QA.PT is a new human–LLM co-curated benchmark designed for causal question answering in Portuguese, addressing a significant gap in high-quality evaluation resources for non-English causal reasoning. The dataset was developed using a hybrid human–LLM process involving targeted generation, validation, and evaluation, and is structured according to the PEARL causal typology. Researchers evaluated Large Language Models' ability to answer Portuguese causal questions and investigated the impact of explicitly including causal class information in prompt design. Results indicate that current LLMs can generate high-quality causal responses in Portuguese, with GPT-5 Mini showing strong performance in judgment-based evaluations. Providing explicit causal class information offered question- and model-dependent benefits, especially for interventional and counterfactual questions. The study also noted that human reference answers were not consistently superior, highlighting the need for meticulous benchmark curation.
Key takeaway
For research scientists developing or evaluating LLMs for non-English languages, Causal_QA.PT offers a vital resource for assessing causal reasoning in Portuguese. You should consider integrating this benchmark into your evaluation pipelines and explore the benefits of explicitly providing causal class information in your prompt engineering, particularly for interventional and counterfactual questions, to potentially enhance model performance.
Key insights
A new human-LLM co-curated benchmark improves causal reasoning evaluation for Portuguese LLMs.
Principles
- Hybrid human-LLM curation enhances benchmark quality.
- Explicit causal class data improves LLM performance.
- Human references are not always superior to LLM outputs.
Method
The Causal_QA.PT benchmark was developed using a hybrid human–LLM process with targeted generation, validation, and evaluation procedures, organized by the PEARL causal typology.
In practice
- Use Causal_QA.PT for Portuguese causal QA evaluation.
- Consider GPT-5 Mini for Portuguese causal tasks.
- Experiment with explicit causal class prompting.
Topics
- Causal_QA.PT
- Causal Question Answering
- Portuguese Language
- Large Language Models
- PEARL Causal Typology
Best for: Research Scientist, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.