AMALIA-VL: A Native European Portuguese Open-Source Vision and Language Model

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision, Natural Language Processing · Depth: Expert, quick

Summary

AMALIA-VL is introduced as the first open-source instruction-tuned Large Vision and Language Model (LVLM) developed natively for European Portuguese (pt-PT). This model addresses the significant underserved status of pt-PT in existing multimodal models, which often conflate it with Brazilian Portuguese or lack sufficient representation in training data. AMALIA-VL integrates a high-resolution vision encoder with dynamic image tiling and a pt-PT-optimized language model via a learned connector. Its development involved a purposefully designed three-stage training process: vision-language alignment, general visual instruction tuning, and preference optimization. The model utilizes a pt-PT-centric multimodal data mix, combining curated and translated public datasets with novel resources to overcome the scarcity of European Portuguese multimodal data. Evaluation indicates AMALIA-VL establishes a strong baseline for open-source pt-PT LVLMs, with model weights, training data, construction pipelines, and machine-translated pt-PT evaluation benchmarks slated for release.

Key takeaway

For NLP Engineers and AI Scientists developing multimodal applications for European Portuguese, AMALIA-VL offers a critical, natively optimized open-source foundation. Your existing models likely underperform due to data conflation or scarcity; this release provides a robust baseline and dedicated resources. Consider integrating AMALIA-VL to improve regional accuracy and performance. You should explore its released weights and training pipelines to accelerate your pt-PT LVLM development efforts.

Key insights

AMALIA-VL is the first open-source LVLM natively optimized for European Portuguese, addressing a critical language resource gap.

Principles

Method

AMALIA-VL's training involves vision-language alignment, general visual instruction tuning, and preference optimization, using a pt-PT-centric multimodal data mix.

In practice

Topics

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.