Textual Inference in Portuguese: Comparing Language Models
Summary
A preliminary investigation explores the development of FraCaS-BR, a Portuguese adaptation of the FraCaS benchmark for semantic inference, to evaluate large language models' (LLMs) logic-sensitive semantic reasoning abilities outside English. Researchers tested ChatGPT, Maritalk, and Evaristo on a small diagnostic subset of seven FraCaS problems translated into Brazilian Portuguese, focusing on generalized quantifiers, plurals, and nominal anaphora. Each problem was submitted multiple times to assess correctness, variance, and consistency against original FraCaS gold labels. The study, presented at PROPOR 2026, found systematic differences among models, with ChatGPT demonstrating higher overall correctness and stability. However, all models showed limitations in logic-controlled inference tasks, necessitating human intervention during translation and evaluation. These findings motivate the continued development of FraCaS-BR as a controlled resource for assessing semantic reasoning in Portuguese.
Key takeaway
For research scientists developing or evaluating LLMs for non-English natural language inference, you should prioritize the creation and utilization of logic-controlled, language-specific benchmarks like FraCaS-BR. Your evaluations must account for systematic model differences and incorporate human-in-the-loop validation to ensure reliability and address limitations in semantic reasoning, especially for complex linguistic phenomena.
Key insights
LLMs struggle with logic-sensitive semantic reasoning in Portuguese, highlighting the need for specialized benchmarks.
Principles
- LLM performance varies systematically across models.
- Human-in-the-loop evaluation is crucial for NLI benchmarks.
Method
A diagnostic subset of seven FraCaS problems was translated into Brazilian Portuguese and submitted multiple times to LLMs (ChatGPT, Maritalk, Evaristo) to assess correctness, variance, and consistency against gold labels.
In practice
- Use FraCaS-BR for Portuguese NLI evaluation.
- Incorporate human review for NLI dataset translation.
Topics
- Textual Inference
- Natural Language Inference
- Portuguese Language Models
- FraCaS Benchmark
- Semantic Reasoning
Best for: Research Scientist, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.