Indi-RomCoM: Code-Mixed Benchmark for Evaluating LLMs on Romanized Indic-English Instructions
Summary
The Indi-RomCoM benchmark addresses the gap in evaluating Large Language Models (LLMs) on Romanized Code Mixing (RCM), a prevalent communication style blending local languages with English in Roman script. This new benchmark facilitates systematic evaluation across seven instruction-following tasks, four widely spoken Indic languages, and three controlled code-mixing intensity levels. Extensive evaluation of proprietary, open-weight, and Indic-focused LLMs under zero- and few-shot settings reveals consistent underperformance on RCM instructions. Performance degrades notably as code-mixing density increases. Interestingly, reasoning tasks exhibit less degradation than detection tasks like Toxicity, primarily because generated explanations provide necessary context.
Key takeaway
For NLP engineers developing or deploying multilingual Large Language Models, you should prioritize robust handling of Romanized Code Mixing (RCM). Current LLMs consistently underperform on RCM instructions, with performance worsening as code-mixing density increases. Focus your development efforts on improving RCM comprehension, particularly for detection tasks, and consider how generated explanations might mitigate degradation in reasoning tasks.
Key insights
LLMs consistently underperform on Romanized Code-Mixed instructions, with performance degrading as code-mixing density increases.
Principles
- Romanized Code Mixing is a dominant multilingual communication form.
- LLM performance on RCM degrades with increased code-mixing density.
- Reasoning tasks are more robust than detection tasks in RCM contexts.
Method
The Indi-RomCoM benchmark evaluates LLMs on seven instruction-following tasks across four Indic languages and three code-mixing intensity levels.
In practice
- Systematically evaluate LLMs for Romanized Code-Mixed instruction following.
- Develop more inclusive multilingual LLM systems.
Topics
- Romanized Code Mixing
- LLM Evaluation
- Indic Languages
- Multilingual NLP
- Instruction Following
- Benchmark Development
Best for: Research Scientist, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.