Mining Useful General Data for Low-Resource Domain Adaptation

2026-06-08 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing, Data Science & Analytics · Depth: Expert, extended

Summary

NTK-Selector is a novel framework designed to enhance large language model (LLM) performance in low-resource domains by effectively selecting high-value auxiliary data from vast general-domain corpora. Addressing challenges like data scarcity and overfitting, the method leverages Neural Tangent Kernels (NTK) to identify relevant samples. It overcomes theoretical assumptions and computational costs of NTK for LLMs by empirically demonstrating stable NTK-like behavior during LoRA fine-tuning and introducing a scalable Jacobian-free approximation. The two-stage framework first uses embedding similarity for coarse-grained pre-selection, then applies fine-grained NTK scoring with LoRA-based gradient computation and random projection. Experiments across medical, financial, legal, and psychological domains show that while 1,000 in-domain samples yielded only +0.8 to +0.9 points for Llama3-8B-Instruct and Qwen3-8B, adding 9,000 NTK-selected auxiliary samples resulted in substantial gains of +8.7 and +5.1 points, representing 10.9x and 5.7x improvements respectively.

Key takeaway

For machine learning engineers adapting LLMs to low-resource domains, where in-domain data is scarce, you should integrate NTK-Selector. This framework reliably selects high-value auxiliary data, preventing overfitting and significantly boosting performance. Expect substantial gains, especially with very limited target data, as it compensates effectively. Optimize your computational resources by balancing pre-selection size and projection dimension for efficiency.

Key insights

Neural Tangent Kernels can effectively select high-value auxiliary data for low-resource LLM domain adaptation.

Principles

LLMs under LoRA fine-tuning exhibit stable NTK-like behavior.
NTK similarity predicts auxiliary data's impact on target performance.
Auxiliary data quality outweighs quantity for performance gains.

Method

NTK-Selector employs a two-stage process: initial embedding-based pre-selection, followed by fine-grained scoring using a Jacobian-free NTK approximation with LoRA gradients and random projection.

In practice

Apply NTK-Selector to augment limited in-domain LLM datasets.
Use LoRA and random projection for efficient NTK computation.
Optimize pre-selection size (4N-16N) and projection dimension (≥1024).

Topics

Large Language Models
Low-Resource Domains
Domain Adaptation
Neural Tangent Kernel
Data Selection
LoRA Fine-tuning

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.