Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models

2026-04-10 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

Almieyar-Oryx-BloomBench introduces BloomBench, the first cognitively human-grounded, bilingual (English–Arabic) multimodal benchmark for Vision-Language Models (VLMs). This benchmark systematically evaluates six cognitive levels (Remember, Understand, Apply, Analyze, Evaluate, Create) based on Bloom's Taxonomy, using carefully designed image–question–answer tasks. Built with a semi-automated pipeline and stratified hybrid quality assurance, it ensures scalability and cultural inclusivity. A comprehensive study using BloomBench reveals that state-of-the-art VLMs, including Gemma 3 (4B, 12B, 27B), Gemma 4 (26B-A4B, 31B), Qwen2.5-VL-7B, Qwen2-VL-7B, and GPT-4o mini, exhibit sharp cognitive asymmetry. Models achieve strong performance in semantic understanding and evaluation but struggle with factual recall and creative synthesis, masking deeper limitations. The study also highlights a critical performance gap between Arabic and English, exposing cross-lingual multimodal reasoning limitations.

Key takeaway

For VLM developers and researchers aiming to build more human-like multimodal intelligence, you should prioritize addressing the identified cognitive asymmetries. Focus development efforts on improving factual recall and creative synthesis capabilities, as current models show significant weaknesses despite strong semantic understanding. Additionally, invest in robust cross-lingual generalization, particularly for languages like Arabic, by mitigating tokenization biases and enhancing underlying reasoning for procedural application and creative tasks. This will lead to more cognitively aligned and inclusive VLMs.

Key insights

BloomBench reveals VLM cognitive asymmetry, excelling in understanding/evaluation but failing in recall/creation, especially cross-lingually.

Principles

Evaluate VLMs across all six Bloom's Taxonomy cognitive levels.
Design advanced VLM tasks to build on foundational cognitive skills.
Ground VLM evaluation in real-world, context-rich visual scenarios.

Method

BloomBench uses a semi-automated pipeline for VQA generation, including scenario ideation, image sourcing, open-ended VQA generation, multiple-choice conversion, and bilingual translation, validated by LLM-as-a-judge and human review.

In practice

Use Likelihood-based Scoring (LBS) to diagnose VLM internal confidence.
Prioritize VLM development for factual recall and creative synthesis.
Address cross-lingual performance gaps, especially for Arabic.

Topics

Vision-Language Models
Multimodal Benchmarking
Bloom's Taxonomy
Cognitive Evaluation
Bilingual AI
Arabic NLP

Code references

qcri/Almieyar-Oryx-BloomBench

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.