EvoSelect: Data-Efficient LLM Evolution for Targeted Task Adaptation

2026-04-30 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

EvoSelect is a novel framework designed for data-efficient adaptation of large language models (LLMs) to targeted tasks, addressing the challenges of noisy, redundant, or misaligned synthetic data. It introduces an iterative generation-selection-training loop that incorporates a crucial selection step before model updates. EvoSelect jointly models targeted task alignment and diversity to select high-utility training data from candidate samples produced by an external data generator. Task relevance is estimated using optimal transport with proxy gradient representations, quantifying alignment with the target task distribution, while a diversification mechanism mitigates redundancy. Experiments on 10 knowledge-intensive benchmarks, including scientific, commonsense, logical, and biomedical question answering datasets, demonstrate that EvoSelect consistently improves adaptation efficacy over existing data selection methods, outperforming baselines with both weak and strong data generators like Qwen2.5-14B-Instruct and Qwen2.5-3B-Instruct.

Key takeaway

For NLP engineers and research scientists adapting LLMs to specific tasks, EvoSelect offers a robust method to improve model performance and prevent harmful adaptation. By integrating optimal transport for task alignment and diversity regularization, your teams can efficiently select high-quality synthetic data, even with limited resources. This approach ensures more stable and effective LLM evolution, particularly on challenging tasks, leading to consistent performance gains over base models and traditional data selection strategies.

Key insights

EvoSelect efficiently adapts LLMs by selecting diverse, task-aligned synthetic data using optimal transport and diversity regularization.

Principles

Jointly optimize task alignment and data diversity.
Use optimal transport for comprehensive distribution alignment.
Proxy models can efficiently generate gradient representations.

Method

EvoSelect employs an iterative optimization procedure that combines optimal transport gradients for task alignment with diversity gradients, using a proxy model for efficient gradient feature collection, to refine sample weights and select top-k samples.

In practice

Implement an iterative generation-selection-training loop.
Utilize proxy models to reduce gradient computation costs.
Evaluate data selection methods using Vendi Score for diversity.

Topics

LLM Adaptation
Data Selection
Optimal Transport
Data Diversity
Task Alignment

Code references

Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.