LARI Dataset: A Native Portuguese Question Answering Dataset from Brasileiras em PLN

2026-04-12 · Source: Paper Index on ACL Anthology · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, medium

Summary

The LARI dataset, introduced at PROPOR 2026, is a new resource for benchmarking and enhancing Question Answering (QA) in Portuguese, addressing a critical lack of native training data for the language. Developed by Júlia da Rocha Junqueira, Larissa A. de Freitas, and Ulisses Brisolara Corrêa, the dataset was created using a methodology that fine-tuned the Sabiá-7B model via QLoRA on a domain-specific corpus. Content for the dataset was extracted from the book "Natural Language Processing – Concepts, Techniques, and Applications in Portuguese (2nd Edition)". The generated context-question-answer triples underwent expert human evaluation, achieving an average quality score of 4.47 out of 5. Comprising 464 triples, the LARI dataset is publicly available, aiming to support future research in low-resource language settings.

Key takeaway

For research scientists developing Question Answering systems for low-resource languages, the LARI dataset offers a valuable, human-validated resource specifically for Portuguese. You should consider integrating this publicly available dataset into your training and evaluation pipelines to improve model performance and benchmark against a high-quality, native language standard. This can accelerate progress in Portuguese NLP, reducing reliance on translated or less relevant data.

Key insights

LARI dataset provides a human-validated Portuguese QA resource, addressing a critical language resource gap.

Principles

Human validation improves dataset quality.
Domain-specific fine-tuning enhances model performance.

Method

The methodology involved fine-tuning the Sabiá-7B model with QLoRA on a domain-specific corpus, extracting content from a Portuguese NLP textbook, and then performing expert human validation on the generated QA instances.

In practice

Use Sabiá-7B with QLoRA for Portuguese NLP tasks.
Leverage LARI for Portuguese QA model benchmarking.

Topics

LARI Dataset
Portuguese Question Answering
Sabiá-7B Model
QLoRA
Low-Resource NLP

Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.