Linguistic Symptoms: Augmented Generation by Retrieval and Reasoning in LLMs under Portuguese-English Variation in Medical Contexts

· Source: Paper Index on ACL Anthology · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing, Medical Devices & Health Technology · Depth: Expert, quick

Summary

A study investigated the impact of linguistic variation on Large Language Models (LLMs) in medical reasoning tasks, specifically comparing Portuguese and English inputs. Researchers used two variants of the MedGemma model, with 4B and 27B parameters, and evaluated them across three medical datasets. The evaluation combined quantitative accuracy metrics with qualitative and structural analyses of the models' reasoning chains. Results showed that smaller models, like the 4B-parameter MedGemma, were significantly affected by linguistic variation, performing consistently worse with Portuguese inputs. In contrast, the 27B-parameter variant demonstrated greater cross-language robustness, maintaining similar accuracy and reasoning structures in both languages. While the implemented Retrieval-Augmented Generation (RAG) system achieved good document retrieval quality, it did not consistently improve the smaller model's performance, suggesting limitations in its ability to effectively utilize additional context.

Key takeaway

For AI Scientists developing medical LLMs for multilingual environments, your choice of model size is critical. Smaller models (e.g., 4B parameters) exhibit significant performance degradation with non-English inputs like Portuguese, while larger models (e.g., 27B parameters) maintain robustness. You should prioritize larger models for cross-language consistency and thoroughly test RAG integration, as it may not mitigate linguistic limitations for smaller architectures.

Key insights

Linguistic variation significantly impacts smaller LLMs in medical contexts, with larger models showing greater cross-language robustness.

Principles

Method

Experiments used MedGemma 4B and 27B variants on medical datasets, evaluating quantitative accuracy and qualitative reasoning chains under Portuguese and English input variations, with and without RAG.

In practice

Topics

Best for: AI Scientist, NLP Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.