Textual Inference in Portuguese: Comparing Language Models

2026-04-12 · Source: Paper Index on ACL Anthology · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A preliminary investigation explores the development of FraCaS-BR, a Portuguese adaptation of the FraCaS benchmark for semantic inference, to evaluate large language models' (LLMs) logic-sensitive semantic reasoning abilities outside English. Researchers tested ChatGPT, Maritalk, and Evaristo on a small diagnostic subset of seven FraCaS problems translated into Brazilian Portuguese, focusing on generalized quantifiers, plurals, and nominal anaphora. Each problem was submitted multiple times to assess correctness, variance, and consistency against original FraCaS gold labels. The study, presented at PROPOR 2026, found systematic differences among models, with ChatGPT demonstrating higher overall correctness and stability. However, all models showed limitations in logic-controlled inference tasks, necessitating human intervention during translation and evaluation. These findings motivate the continued development of FraCaS-BR as a controlled resource for assessing semantic reasoning in Portuguese.

Key takeaway

For research scientists developing or evaluating LLMs for non-English natural language inference, you should prioritize the creation and utilization of logic-controlled, language-specific benchmarks like FraCaS-BR. Your evaluations must account for systematic model differences and incorporate human-in-the-loop validation to ensure reliability and address limitations in semantic reasoning, especially for complex linguistic phenomena.

Key insights

LLMs struggle with logic-sensitive semantic reasoning in Portuguese, highlighting the need for specialized benchmarks.

Principles

LLM performance varies systematically across models.
Human-in-the-loop evaluation is crucial for NLI benchmarks.

Method

A diagnostic subset of seven FraCaS problems was translated into Brazilian Portuguese and submitted multiple times to LLMs (ChatGPT, Maritalk, Evaristo) to assess correctness, variance, and consistency against gold labels.

In practice

Use FraCaS-BR for Portuguese NLI evaluation.
Incorporate human review for NLI dataset translation.

Topics

Textual Inference
Natural Language Inference
Portuguese Language Models
FraCaS Benchmark
Semantic Reasoning

Best for: Research Scientist, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.