Less is More: Quality-Aware Training Data Selection for Scientific Summarization
Summary
Less is More: Quality-Aware Training Data Selection for Scientific Summarization introduces a new approach to improve long-document summarization by addressing dataset limitations and reference quality. The authors constructed and released one of the largest biomedical and life science datasets, comprising 1.88 million PMC articles, specifically designed for long-context models. They analyzed author-written abstracts, commonly used as gold references, finding significant variability in their alignment with full articles. By applying source-grounded and model-based metrics, they demonstrated that these quality signals can effectively guide training data selection. Training on high-quality subsets not only outperformed random sampling at equivalent sizes but also matched or exceeded larger random subsets on factuality-oriented metrics, indicating improved training efficiency and the critical role of reference quality in scientific summarization.
Key takeaway
For Machine Learning Engineers developing scientific summarization models, you should prioritize data quality over sheer volume. Instead of relying solely on author-written abstracts as gold standards, integrate quality-aware data selection methods. This approach, leveraging metrics for high-alignment references, will improve your model's factuality and training efficiency. You can potentially match larger, randomly sampled datasets with significantly less data.
Key insights
Selecting high-quality reference summaries significantly enhances scientific summarization model training efficiency and factuality.
Principles
- Author abstracts are not uniformly high-quality.
- Reference quality impacts summarization factuality.
- Data selection improves training efficiency.
Method
A 1.88 million PMC article dataset was built. Abstract quality was analyzed using source-grounded and model-based metrics, then these quality signals guided the selection of high-quality training subsets for summarization models.
In practice
- Filter summarization datasets by reference quality.
- Prioritize quality over raw data volume.
- Apply source-grounded quality metrics.
Topics
- Scientific Summarization
- Data Quality
- Training Data Selection
- Biomedical Datasets
- Long-Context Models
- Factuality Metrics
Code references
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.