Addressing Data Scarcity in Bangla Fake News Detection: An LLM-Based Dataset Augmentation Approach
Summary
This study introduces an LLM-based data augmentation framework to enhance fake news detection in Bangla, a low-resource language, addressing limitations of small and imbalanced datasets like BanFakeNews. Researchers used the instruction-tuned Gemma-3-27B-IT model to generate 4,545 synthetic Bangla fake news articles. The framework employs semantic filtering and controlled subsampling to maintain label consistency and diversity. Experiments compared zero-shot and few-shot prompting, multiple augmentation rates, and random versus similarity-based selection strategies. The most effective configuration involved augmenting only the minority fake news class with a high augmentation rate (K=5) and random subsampling using zero-shot prompting, which improved the Fake News F1 score from 0.8560 to 0.8800, a 2.4-point gain. The generated dataset and implementation are publicly released to support further research.
Key takeaway
For AI Engineers and Research Scientists working on natural language processing in under-resourced languages, this research demonstrates that carefully controlled LLM-based data augmentation can significantly boost model performance. You should prioritize augmenting only the minority class, using zero-shot prompting, and employing random subsampling at higher augmentation rates to maximize linguistic diversity and improve fake news detection F1 scores. This approach offers a practical path to overcome data scarcity challenges.
Key insights
LLM-based data augmentation significantly improves fake news detection in low-resource languages by generating diverse synthetic data.
Principles
- Targeted minority class oversampling is effective.
- Linguistic diversity in synthetic data enhances generalization.
- Zero-shot prompting can outperform few-shot in low-resource settings.
Method
The proposed framework generates synthetic news using Gemma-3-27B-IT with zero-shot prompting, followed by random subsampling of K=5 articles to augment only the minority fake news class.
In practice
- Augment only minority classes for imbalanced datasets.
- Prioritize random subsampling over similarity-based for diversity.
- Consider zero-shot prompting for LLM-based augmentation in low-resource NLP.
Topics
- Bangla Fake News Detection
- LLM Data Augmentation
- Gemma-3-27B-IT
- Zero-Shot Prompting
- Class Imbalance
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.