Addressing Data Scarcity in Bangla Fake News Detection: An LLM-Based Dataset Augmentation Approach

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Natural Language Processing · Depth: Expert, extended

Summary

This study introduces an LLM-based data augmentation framework to enhance fake news detection in Bangla, a low-resource language, addressing limitations of small and imbalanced datasets like BanFakeNews. Researchers used the instruction-tuned Gemma-3-27B-IT model to generate 4,545 synthetic Bangla fake news articles. The framework employs semantic filtering and controlled subsampling to maintain label consistency and diversity. Experiments compared zero-shot and few-shot prompting, multiple augmentation rates, and random versus similarity-based selection strategies. The most effective configuration involved augmenting only the minority fake news class with a high augmentation rate (K=5) and random subsampling using zero-shot prompting, which improved the Fake News F1 score from 0.8560 to 0.8800, a 2.4-point gain. The generated dataset and implementation are publicly released to support further research.

Key takeaway

For AI Engineers and Research Scientists working on natural language processing in under-resourced languages, this research demonstrates that carefully controlled LLM-based data augmentation can significantly boost model performance. You should prioritize augmenting only the minority class, using zero-shot prompting, and employing random subsampling at higher augmentation rates to maximize linguistic diversity and improve fake news detection F1 scores. This approach offers a practical path to overcome data scarcity challenges.

Key insights

LLM-based data augmentation significantly improves fake news detection in low-resource languages by generating diverse synthetic data.

Principles

Method

The proposed framework generates synthetic news using Gemma-3-27B-IT with zero-shot prompting, followed by random subsampling of K=5 articles to augment only the minority fake news class.

In practice

Topics

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.