How Good LLMs Are at Answering Bangla Medical Visual Questions? Dataset and Benchmarking

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Medical Devices & Health Technology · Depth: Expert, quick

Summary

A new dataset, BanglaMedVQA, has been introduced to benchmark Large Language Models (LLMs) and Large Vision Language Models (LVLMs) on medical visual question answering (MedVQA) in Bangla. This dataset consists of clinically validated image-question-answer pairs, addressing a significant gap for one of the world's most widely spoken languages. Initial evaluations of current foundation models, including Gemini and GPT-4.1 mini, on BanglaMedVQA reveal substantially lower performance compared to English MedVQA benchmarks. Models struggle with specialized diagnostic questions and fine-grained medical reasoning, indicating severe limitations in low-resource language contexts. While some open-source models like Gemma-3 occasionally show better general performance, they also fail on clinically complex questions, highlighting the need for improved evaluation methods and model capabilities.

Key takeaway

For AI Scientists and Machine Learning Engineers developing medical AI, the BanglaMedVQA dataset highlights critical performance gaps in low-resource languages. Your current foundation models, even top-tier ones like Gemini and GPT-4.1 mini, are likely insufficient for accurate clinical reasoning in Bangla. Prioritize research into language-specific fine-tuning and advanced reasoning architectures to address these limitations and ensure equitable access to medical AI.

Key insights

BanglaMedVQA dataset reveals current LLMs and LVLMs perform poorly on medical visual questions in Bangla.

Principles

In practice

Topics

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.