AMALIA-VL: A Native European Portuguese Open-Source Vision and Language Model
Summary
AMALIA-VL is introduced as the first open-source instruction-tuned Large Vision and Language Model (LVLM) developed natively for European Portuguese (pt-PT). This model addresses the significant underserved status of pt-PT in existing multimodal models, which often conflate it with Brazilian Portuguese or lack sufficient representation in training data. AMALIA-VL integrates a high-resolution vision encoder with dynamic image tiling and a pt-PT-optimized language model via a learned connector. Its development involved a purposefully designed three-stage training process: vision-language alignment, general visual instruction tuning, and preference optimization. The model utilizes a pt-PT-centric multimodal data mix, combining curated and translated public datasets with novel resources to overcome the scarcity of European Portuguese multimodal data. Evaluation indicates AMALIA-VL establishes a strong baseline for open-source pt-PT LVLMs, with model weights, training data, construction pipelines, and machine-translated pt-PT evaluation benchmarks slated for release.
Key takeaway
For NLP Engineers and AI Scientists developing multimodal applications for European Portuguese, AMALIA-VL offers a critical, natively optimized open-source foundation. Your existing models likely underperform due to data conflation or scarcity; this release provides a robust baseline and dedicated resources. Consider integrating AMALIA-VL to improve regional accuracy and performance. You should explore its released weights and training pipelines to accelerate your pt-PT LVLM development efforts.
Key insights
AMALIA-VL is the first open-source LVLM natively optimized for European Portuguese, addressing a critical language resource gap.
Principles
- Language-specific models outperform generalized ones.
- Curated, novel data is crucial for underserved languages.
- Multi-stage training improves LVLM performance.
Method
AMALIA-VL's training involves vision-language alignment, general visual instruction tuning, and preference optimization, using a pt-PT-centric multimodal data mix.
In practice
- Develop language-specific LVLMs for regional accuracy.
- Create novel datasets for under-resourced languages.
- Utilize multi-stage instruction tuning for robust models.
Topics
- Large Vision and Language Models
- European Portuguese
- Multimodal AI
- Instruction Tuning
- Open-Source Models
- Dataset Curation
Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.