A BART-based approach with hierarchical strategy for Vietnamese abstractive multi-document summarization

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Advanced, long

Summary

A novel BART-based hierarchical approach addresses the Vietnamese multi-document abstractive summarization challenge from VLSP 2022. This method first condenses individual documents using a "golden summary"-driven strategy to ensure high correlation, then aggregates and summarizes them. It achieved a ROUGE2-F1 score of 0.2468 on the VLSP public test set, generating fluent and concise summaries. The research also significantly expanded available data by translating the Multinews dataset into Vietnamese, adding 50286 document clusters, which are now publicly available. This tackles the scarcity of Vietnamese multi-document summarization data and the input length limitations of models like BART for typical 2220-token clusters.

Key takeaway

For NLP Engineers developing multi-document summarization systems for low-resource languages like Vietnamese, consider adopting a hierarchical strategy. Your approach should involve an intermediate document shortening step guided by the "golden summary" to maintain information correlation. Additionally, augment your training data by translating large English datasets, such as Multinews, to overcome data scarcity. This method effectively addresses long input sequence challenges and improves summary fluency and conciseness.

Key insights

A hierarchical, BART-based approach for Vietnamese multi-document summarization uses "golden summaries" to guide document shortening, preventing information loss.

Principles

Method

A two-step process: 1) Train a seq-to-seq model to select top sentences from each document, guided by ROUGE1 scores against the golden summary. 2) Concatenate selected sentences and feed to a second seq-to-seq model for final summary generation.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.