Applications of data valuation in machine learning
Summary
Data valuation, a technique for quantifying the contribution of individual training samples to a model's performance, offers significant applications across data engineering, model development, and data markets. In data engineering, it helps repair and prune corrupt data by identifying low-value samples, which can improve test error and provide insights into data quality. It also aids in pruning superfluous data, with methods like TS-DShapley for LLMs, to reduce computational costs and identify redundant information. For active learning, data valuation, particularly using Shapley values or influence functions, can efficiently select batches of unlabeled data for annotation, potentially reducing annotation costs by 10-30%. In model development, it assists in interpreting and debugging models by exploring the impact of specific training samples on predictions and performing sensitivity analysis to understand model robustness. Furthermore, data valuation can detect data manipulation, such as watermark removal and poisoning attacks, and is crucial for establishing fair pricing in two-sided data markets, connecting data providers and consumers.
Key takeaway
For AI Engineers focused on optimizing model performance and data efficiency, understanding data valuation techniques is crucial. You should explore integrating data valuation methods like Shapley values or influence functions into your data pipelines to identify and remove harmful or redundant data, thereby improving model robustness and reducing training costs. Consider using these methods for active learning strategies to make your data annotation efforts more targeted and economically viable.
Key insights
Data valuation quantifies individual training sample contributions, enabling data quality improvement, model debugging, and fair data pricing.
Principles
- Low-value data often indicates corruption or redundancy.
- Data value can guide efficient data acquisition.
- Model sensitivity to data removal reveals intrinsic problem qualities.
Method
Data valuation typically involves computing a score for each training sample based on its contribution to model performance, then using these scores to rank, prune, or select data for various tasks.
In practice
- Discard lowest-valued samples to improve model performance.
- Use influence functions to prioritize label fixing for erroneous predictions.
- Apply TS-DShapley for efficient LLM fine-tuning data selection.
Topics
- Data Valuation
- Data Engineering
- Model Debugging
- Data Marketplaces
- Shapley Values
Code references
- rgiordan/zaminfluence
- amiratag/DataShapley
- AIsecure/Shapley-Study
- ruoxi-jia-group/LAVA
- ykwon0407/dataoob
Best for: AI Engineer, NLP Engineer, Computer Vision Engineer, Machine Learning Engineer, Data Scientist, AI Researcher
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Blog of the TransferLab — appliedAI Institute.