Less is More: Quality-Aware Training Data Selection for Scientific Summarization

· Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing, Health & Medical Research · Depth: Advanced, medium

Summary

Less is More: Quality-Aware Training Data Selection for Scientific Summarization introduces a new approach to improve long-document summarization by addressing dataset limitations and reference quality. The authors constructed and released one of the largest biomedical and life science datasets, comprising 1.88 million PMC articles, specifically designed for long-context models. They analyzed author-written abstracts, commonly used as gold references, finding significant variability in their alignment with full articles. By applying source-grounded and model-based metrics, they demonstrated that these quality signals can effectively guide training data selection. Training on high-quality subsets not only outperformed random sampling at equivalent sizes but also matched or exceeded larger random subsets on factuality-oriented metrics, indicating improved training efficiency and the critical role of reference quality in scientific summarization.

Key takeaway

For Machine Learning Engineers developing scientific summarization models, you should prioritize data quality over sheer volume. Instead of relying solely on author-written abstracts as gold standards, integrate quality-aware data selection methods. This approach, leveraging metrics for high-alignment references, will improve your model's factuality and training efficiency. You can potentially match larger, randomly sampled datasets with significantly less data.

Key insights

Selecting high-quality reference summaries significantly enhances scientific summarization model training efficiency and factuality.

Principles

Method

A 1.88 million PMC article dataset was built. Abstract quality was analyzed using source-grounded and model-based metrics, then these quality signals guided the selection of high-quality training subsets for summarization models.

In practice

Topics

Code references

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.