A BART-based approach with hierarchical strategy for Vietnamese abstractive multi-document summarization
Summary
A novel BART-based hierarchical approach addresses the Vietnamese multi-document abstractive summarization challenge from VLSP 2022. This method first condenses individual documents using a "golden summary"-driven strategy to ensure high correlation, then aggregates and summarizes them. It achieved a ROUGE2-F1 score of 0.2468 on the VLSP public test set, generating fluent and concise summaries. The research also significantly expanded available data by translating the Multinews dataset into Vietnamese, adding 50286 document clusters, which are now publicly available. This tackles the scarcity of Vietnamese multi-document summarization data and the input length limitations of models like BART for typical 2220-token clusters.
Key takeaway
For NLP Engineers developing multi-document summarization systems for low-resource languages like Vietnamese, consider adopting a hierarchical strategy. Your approach should involve an intermediate document shortening step guided by the "golden summary" to maintain information correlation. Additionally, augment your training data by translating large English datasets, such as Multinews, to overcome data scarcity. This method effectively addresses long input sequence challenges and improves summary fluency and conciseness.
Key insights
A hierarchical, BART-based approach for Vietnamese multi-document summarization uses "golden summaries" to guide document shortening, preventing information loss.
Principles
- Hierarchical summarization benefits from intermediate representations correlated with the final summary.
- Data scarcity in low-resource languages can be mitigated by translating large English datasets.
Method
A two-step process: 1) Train a seq-to-seq model to select top sentences from each document, guided by ROUGE1 scores against the golden summary. 2) Concatenate selected sentences and feed to a second seq-to-seq model for final summary generation.
In practice
- Translate large English datasets to augment low-resource language datasets.
- Use a two-step hierarchical model for long input sequences in multi-document summarization.
Topics
- Vietnamese NLP
- Multi-document Summarization
- Abstractive Summarization
- BARTPho
- Hierarchical Models
- Data Augmentation
- VLSP 2022
Code references
Best for: Research Scientist, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.