Mining Useful General Data for Low-Resource Domain Adaptation
Summary
NTK-Selector is a novel framework designed to enhance large language model (LLM) performance in low-resource domains by effectively selecting high-value auxiliary data from vast general-domain corpora. Addressing challenges like data scarcity and overfitting, the method leverages Neural Tangent Kernels (NTK) to identify relevant samples. It overcomes theoretical assumptions and computational costs of NTK for LLMs by empirically demonstrating stable NTK-like behavior during LoRA fine-tuning and introducing a scalable Jacobian-free approximation. The two-stage framework first uses embedding similarity for coarse-grained pre-selection, then applies fine-grained NTK scoring with LoRA-based gradient computation and random projection. Experiments across medical, financial, legal, and psychological domains show that while 1,000 in-domain samples yielded only +0.8 to +0.9 points for Llama3-8B-Instruct and Qwen3-8B, adding 9,000 NTK-selected auxiliary samples resulted in substantial gains of +8.7 and +5.1 points, representing 10.9x and 5.7x improvements respectively.
Key takeaway
For machine learning engineers adapting LLMs to low-resource domains, where in-domain data is scarce, you should integrate NTK-Selector. This framework reliably selects high-value auxiliary data, preventing overfitting and significantly boosting performance. Expect substantial gains, especially with very limited target data, as it compensates effectively. Optimize your computational resources by balancing pre-selection size and projection dimension for efficiency.
Key insights
Neural Tangent Kernels can effectively select high-value auxiliary data for low-resource LLM domain adaptation.
Principles
- LLMs under LoRA fine-tuning exhibit stable NTK-like behavior.
- NTK similarity predicts auxiliary data's impact on target performance.
- Auxiliary data quality outweighs quantity for performance gains.
Method
NTK-Selector employs a two-stage process: initial embedding-based pre-selection, followed by fine-grained scoring using a Jacobian-free NTK approximation with LoRA gradients and random projection.
In practice
- Apply NTK-Selector to augment limited in-domain LLM datasets.
- Use LoRA and random projection for efficient NTK computation.
- Optimize pre-selection size (4N-16N) and projection dimension (≥1024).
Topics
- Large Language Models
- Low-Resource Domains
- Domain Adaptation
- Neural Tangent Kernel
- Data Selection
- LoRA Fine-tuning
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.