HARP: Efficient Data Selection for Finetuning Large Language Models

2026-06-05 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Hierarchical Active Region Pruning (HARP) is an efficient, train-based data selection method designed for finetuning large language models. It addresses the challenge of balancing effective data selection for downstream objectives with the high cost of repeated model finetuning. Unlike scalable but proxy-reliant train-free selectors, or costly train-based methods requiring many train-evaluate iterations, HARP organizes the training data into a node-leaf hierarchy. It evaluates only representative leaves and infers unmeasured utilities using empirical Bayes posteriors. HARP then selects data via two envelopes: HARP-C for conservative redundancy control and HARP-E for additive complementary region rewards. Theoretically, HARP controls selection error and reduces train-evaluate costs under local smoothness and bounded estimation error. HARP variants outperform strong baselines by up to +8.9 points, utilizing approximately 7x fewer training examples.

Key takeaway

For Machine Learning Engineers optimizing large language model finetuning, HARP offers a significant efficiency gain. You can achieve superior downstream performance, up to +8.9 points, while drastically reducing training data requirements by approximately 7x. Consider integrating HARP's hierarchical data selection and empirical Bayes utility inference to streamline your finetuning workflows and control computational expenses. This approach allows for more effective data curation without extensive train-evaluate cycles.

Key insights

HARP efficiently selects finetuning data for LLMs by hierarchically evaluating representative subsets and inferring utilities.

Principles

Balance data utility with selection cost.
Hierarchical data organization reduces evaluation overhead.
Empirical Bayes infers unmeasured data utility.

Method

HARP organizes data into a node-leaf hierarchy, evaluates representative leaves, infers unmeasured utilities with empirical Bayes, then selects data using HARP-C (redundancy control) or HARP-E (complementary rewards).

In practice

Apply HARP to reduce LLM finetuning costs.
Use HARP-C for redundancy-controlled data selection.
Use HARP-E for complementary data selection.

Topics

Large Language Models
Finetuning
Data Selection
HARP
Machine Learning Efficiency
Empirical Bayes

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.