Bangla Key2Text: Text Generation from Keywords for a Low Resource Language

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Bangla Key2Text is a new large-scale dataset comprising 2.6 million Bangla keyword-text pairs, developed to facilitate keyword-driven text generation in Bangla, a low-resource language. The dataset was created by applying a BERT-based keyword extraction pipeline to millions of Bangla news articles, converting raw text into structured keyword-text pairs for supervised learning. Researchers fine-tuned two sequence-to-sequence models, mT5 and BanglaT5, to establish baseline performance on this new benchmark. Evaluations using automatic metrics and human judgments demonstrated that task-specific fine-tuning significantly enhances keyword-conditioned text generation in Bangla, outperforming zero-shot large language models. The dataset, along with the trained models and code, has been made publicly available to support further research in Bangla natural language generation and keyword-to-text tasks.

Key takeaway

For research scientists working on natural language generation in low-resource languages, Bangla Key2Text offers a critical resource. You should consider using this 2.6 million-pair dataset to train and benchmark models for keyword-driven text generation, as task-specific fine-tuning has proven to significantly improve performance over zero-shot approaches. This dataset provides a robust foundation for developing more effective NLG systems in Bangla.

Key insights

A new 2.6M Bangla keyword-text dataset enables supervised keyword-driven text generation for low-resource languages.

Principles

Method

A BERT-based pipeline extracts keywords from raw news articles to form keyword-text pairs, which are then used to fine-tune sequence-to-sequence models like mT5 and BanglaT5 for text generation.

In practice

Topics

Best for: Research Scientist, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.