EvoSelect: Data-Efficient LLM Evolution for Targeted Task Adaptation
Summary
EvoSelect is a novel framework designed for data-efficient adaptation of large language models (LLMs) to targeted tasks, addressing the challenges of noisy, redundant, or misaligned synthetic data. It introduces an iterative generation-selection-training loop that incorporates a crucial selection step before model updates. EvoSelect jointly models targeted task alignment and diversity to select high-utility training data from candidate samples produced by an external data generator. Task relevance is estimated using optimal transport with proxy gradient representations, quantifying alignment with the target task distribution, while a diversification mechanism mitigates redundancy. Experiments on 10 knowledge-intensive benchmarks, including scientific, commonsense, logical, and biomedical question answering datasets, demonstrate that EvoSelect consistently improves adaptation efficacy over existing data selection methods, outperforming baselines with both weak and strong data generators like Qwen2.5-14B-Instruct and Qwen2.5-3B-Instruct.
Key takeaway
For NLP engineers and research scientists adapting LLMs to specific tasks, EvoSelect offers a robust method to improve model performance and prevent harmful adaptation. By integrating optimal transport for task alignment and diversity regularization, your teams can efficiently select high-quality synthetic data, even with limited resources. This approach ensures more stable and effective LLM evolution, particularly on challenging tasks, leading to consistent performance gains over base models and traditional data selection strategies.
Key insights
EvoSelect efficiently adapts LLMs by selecting diverse, task-aligned synthetic data using optimal transport and diversity regularization.
Principles
- Jointly optimize task alignment and data diversity.
- Use optimal transport for comprehensive distribution alignment.
- Proxy models can efficiently generate gradient representations.
Method
EvoSelect employs an iterative optimization procedure that combines optimal transport gradients for task alignment with diversity gradients, using a proxy model for efficient gradient feature collection, to refine sample weights and select top-k samples.
In practice
- Implement an iterative generation-selection-training loop.
- Utilize proxy models to reduce gradient computation costs.
- Evaluate data selection methods using Vendi Score for diversity.
Topics
- LLM Adaptation
- Data Selection
- Optimal Transport
- Data Diversity
- Task Alignment
Code references
Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.