Bangla Key2Text: Text Generation from Keywords for a Low Resource Language
Summary
Bangla Key2Text is a new large-scale dataset comprising 2.6 million Bangla keyword-text pairs, developed to facilitate keyword-driven text generation in Bangla, a low-resource language. The dataset was created by applying a BERT-based keyword extraction pipeline to millions of Bangla news articles, converting raw text into structured keyword-text pairs for supervised learning. Researchers fine-tuned two sequence-to-sequence models, mT5 and BanglaT5, to establish baseline performance on this new benchmark. Evaluations using automatic metrics and human judgments demonstrated that task-specific fine-tuning significantly enhances keyword-conditioned text generation in Bangla, outperforming zero-shot large language models. The dataset, along with the trained models and code, has been made publicly available to support further research in Bangla natural language generation and keyword-to-text tasks.
Key takeaway
For research scientists working on natural language generation in low-resource languages, Bangla Key2Text offers a critical resource. You should consider using this 2.6 million-pair dataset to train and benchmark models for keyword-driven text generation, as task-specific fine-tuning has proven to significantly improve performance over zero-shot approaches. This dataset provides a robust foundation for developing more effective NLG systems in Bangla.
Key insights
A new 2.6M Bangla keyword-text dataset enables supervised keyword-driven text generation for low-resource languages.
Principles
- Task-specific fine-tuning improves generation.
- BERT-based extraction creates structured datasets.
Method
A BERT-based pipeline extracts keywords from raw news articles to form keyword-text pairs, which are then used to fine-tune sequence-to-sequence models like mT5 and BanglaT5 for text generation.
In practice
- Use Bangla Key2Text for Bangla NLG.
- Apply BERT for keyword extraction.
Topics
- Bangla Key2Text
- Keyword-to-Text Generation
- Low-Resource Language NLP
- BERT-based Keyword Extraction
- mT5 and BanglaT5
Best for: Research Scientist, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.