Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

Almieyar-Oryx-BloomBench introduces BloomBench, the first cognitively human-grounded, bilingual (English–Arabic) multimodal benchmark for Vision-Language Models (VLMs). This benchmark systematically evaluates six cognitive levels (Remember, Understand, Apply, Analyze, Evaluate, Create) based on Bloom's Taxonomy, using carefully designed image–question–answer tasks. Built with a semi-automated pipeline and stratified hybrid quality assurance, it ensures scalability and cultural inclusivity. A comprehensive study using BloomBench reveals that state-of-the-art VLMs, including Gemma 3 (4B, 12B, 27B), Gemma 4 (26B-A4B, 31B), Qwen2.5-VL-7B, Qwen2-VL-7B, and GPT-4o mini, exhibit sharp cognitive asymmetry. Models achieve strong performance in semantic understanding and evaluation but struggle with factual recall and creative synthesis, masking deeper limitations. The study also highlights a critical performance gap between Arabic and English, exposing cross-lingual multimodal reasoning limitations.

Key takeaway

For VLM developers and researchers aiming to build more human-like multimodal intelligence, you should prioritize addressing the identified cognitive asymmetries. Focus development efforts on improving factual recall and creative synthesis capabilities, as current models show significant weaknesses despite strong semantic understanding. Additionally, invest in robust cross-lingual generalization, particularly for languages like Arabic, by mitigating tokenization biases and enhancing underlying reasoning for procedural application and creative tasks. This will lead to more cognitively aligned and inclusive VLMs.

Key insights

BloomBench reveals VLM cognitive asymmetry, excelling in understanding/evaluation but failing in recall/creation, especially cross-lingually.

Principles

Method

BloomBench uses a semi-automated pipeline for VQA generation, including scenario ideation, image sourcing, open-ended VQA generation, multiple-choice conversion, and bilingual translation, validated by LLM-as-a-judge and human review.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.