Addressing Data Scarcity in Bangla Fake News Detection: An LLM-Based Dataset Augmentation Approach

2026-05-05 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Natural Language Processing · Depth: Expert, extended

Summary

This study introduces an LLM-based data augmentation framework to enhance fake news detection in Bangla, a low-resource language, addressing limitations of small and imbalanced datasets like BanFakeNews. Researchers used the instruction-tuned Gemma-3-27B-IT model to generate 4,545 synthetic Bangla fake news articles. The framework employs semantic filtering and controlled subsampling to maintain label consistency and diversity. Experiments compared zero-shot and few-shot prompting, multiple augmentation rates, and random versus similarity-based selection strategies. The most effective configuration involved augmenting only the minority fake news class with a high augmentation rate (K=5) and random subsampling using zero-shot prompting, which improved the Fake News F1 score from 0.8560 to 0.8800, a 2.4-point gain. The generated dataset and implementation are publicly released to support further research.

Key takeaway

For AI Engineers and Research Scientists working on natural language processing in under-resourced languages, this research demonstrates that carefully controlled LLM-based data augmentation can significantly boost model performance. You should prioritize augmenting only the minority class, using zero-shot prompting, and employing random subsampling at higher augmentation rates to maximize linguistic diversity and improve fake news detection F1 scores. This approach offers a practical path to overcome data scarcity challenges.

Key insights

LLM-based data augmentation significantly improves fake news detection in low-resource languages by generating diverse synthetic data.

Principles

Targeted minority class oversampling is effective.
Linguistic diversity in synthetic data enhances generalization.
Zero-shot prompting can outperform few-shot in low-resource settings.

Method

The proposed framework generates synthetic news using Gemma-3-27B-IT with zero-shot prompting, followed by random subsampling of K=5 articles to augment only the minority fake news class.

In practice

Augment only minority classes for imbalanced datasets.
Prioritize random subsampling over similarity-based for diversity.
Consider zero-shot prompting for LLM-based augmentation in low-resource NLP.

Topics

Bangla Fake News Detection
LLM Data Augmentation
Gemma-3-27B-IT
Zero-Shot Prompting
Class Imbalance

Code references

phigratio/bangla-fake-news

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.