HARP: Efficient Data Selection for Finetuning Large Language Models
Summary
Hierarchical Active Region Pruning (HARP) is an efficient, train-based data selection method designed for finetuning large language models. It addresses the challenge of balancing effective data selection for downstream objectives with the high cost of repeated model finetuning. Unlike scalable but proxy-reliant train-free selectors, or costly train-based methods requiring many train-evaluate iterations, HARP organizes the training data into a node-leaf hierarchy. It evaluates only representative leaves and infers unmeasured utilities using empirical Bayes posteriors. HARP then selects data via two envelopes: HARP-C for conservative redundancy control and HARP-E for additive complementary region rewards. Theoretically, HARP controls selection error and reduces train-evaluate costs under local smoothness and bounded estimation error. HARP variants outperform strong baselines by up to +8.9 points, utilizing approximately 7x fewer training examples.
Key takeaway
For Machine Learning Engineers optimizing large language model finetuning, HARP offers a significant efficiency gain. You can achieve superior downstream performance, up to +8.9 points, while drastically reducing training data requirements by approximately 7x. Consider integrating HARP's hierarchical data selection and empirical Bayes utility inference to streamline your finetuning workflows and control computational expenses. This approach allows for more effective data curation without extensive train-evaluate cycles.
Key insights
HARP efficiently selects finetuning data for LLMs by hierarchically evaluating representative subsets and inferring utilities.
Principles
- Balance data utility with selection cost.
- Hierarchical data organization reduces evaluation overhead.
- Empirical Bayes infers unmeasured data utility.
Method
HARP organizes data into a node-leaf hierarchy, evaluates representative leaves, infers unmeasured utilities with empirical Bayes, then selects data using HARP-C (redundancy control) or HARP-E (complementary rewards).
In practice
- Apply HARP to reduce LLM finetuning costs.
- Use HARP-C for redundancy-controlled data selection.
- Use HARP-E for complementary data selection.
Topics
- Large Language Models
- Finetuning
- Data Selection
- HARP
- Machine Learning Efficiency
- Empirical Bayes
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.