AMALIA: A Fully Open Large Language Model for European Portuguese

2026-04-12 · Source: Paper Index on ACL Anthology · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, medium

Summary

AMALIA is a new, fully open large language model (LLM) specifically designed for European Portuguese (pt-PT), addressing its underrepresentation in existing LLMs and evaluation benchmarks. Developed by Afonso Simplício et al. and presented at PROPOR 2026, AMALIA prioritizes pt-PT by incorporating more high-quality pt-PT data during its mid- and post-training phases. To facilitate accurate evaluation, the researchers also released a suite of pt-PT benchmarks, comprising translated standard tasks and four novel datasets. These new datasets specifically target pt-PT generation, linguistic competence, and the distinction between pt-PT and pt-BR biases. Experimental results indicate that AMALIA performs comparably to strong baselines on translated benchmarks while demonstrating significant improvements on evaluations tailored to pt-PT.

Key takeaway

For research scientists developing LLMs for specific language variants, you should prioritize creating and utilizing high-quality, variant-specific training data. Additionally, invest in developing native evaluation benchmarks, as machine-translated benchmarks may fail to capture crucial linguistic and cultural nuances, potentially leading to inaccurate performance assessments for your target language.

Key insights

Targeted training and native benchmarking are crucial for underrepresented language variants like European Portuguese.

Principles

Prioritize high-quality data for target language variants.
Develop native benchmarks for accurate evaluation.

Method

AMALIA was developed by integrating more high-quality European Portuguese data during mid- and post-training stages, complemented by a new suite of pt-PT-specific evaluation benchmarks.

In practice

Use pt-PT specific datasets for fine-tuning.
Employ new pt-PT benchmarks for evaluation.

Topics

AMALIA LLM
European Portuguese
Language Model Training
Native Benchmarking
Linguistic Nuances

Best for: Research Scientist, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.