LARI Dataset: A Native Portuguese Question Answering Dataset from Brasileiras em PLN
Summary
The LARI dataset, introduced at PROPOR 2026, is a new resource for benchmarking and enhancing Question Answering (QA) in Portuguese, addressing a critical lack of native training data for the language. Developed by Júlia da Rocha Junqueira, Larissa A. de Freitas, and Ulisses Brisolara Corrêa, the dataset was created using a methodology that fine-tuned the Sabiá-7B model via QLoRA on a domain-specific corpus. Content for the dataset was extracted from the book "Natural Language Processing – Concepts, Techniques, and Applications in Portuguese (2nd Edition)". The generated context-question-answer triples underwent expert human evaluation, achieving an average quality score of 4.47 out of 5. Comprising 464 triples, the LARI dataset is publicly available, aiming to support future research in low-resource language settings.
Key takeaway
For research scientists developing Question Answering systems for low-resource languages, the LARI dataset offers a valuable, human-validated resource specifically for Portuguese. You should consider integrating this publicly available dataset into your training and evaluation pipelines to improve model performance and benchmark against a high-quality, native language standard. This can accelerate progress in Portuguese NLP, reducing reliance on translated or less relevant data.
Key insights
LARI dataset provides a human-validated Portuguese QA resource, addressing a critical language resource gap.
Principles
- Human validation improves dataset quality.
- Domain-specific fine-tuning enhances model performance.
Method
The methodology involved fine-tuning the Sabiá-7B model with QLoRA on a domain-specific corpus, extracting content from a Portuguese NLP textbook, and then performing expert human validation on the generated QA instances.
In practice
- Use Sabiá-7B with QLoRA for Portuguese NLP tasks.
- Leverage LARI for Portuguese QA model benchmarking.
Topics
- LARI Dataset
- Portuguese Question Answering
- Sabiá-7B Model
- QLoRA
- Low-Resource NLP
Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.