Applications of data valuation in machine learning

2023-11-20 · Source: Blog of the TransferLab — appliedAI Institute · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Cybersecurity & Data Privacy · Depth: Advanced, extended

Summary

Data valuation, a technique for quantifying the contribution of individual training samples to a model's performance, offers significant applications across data engineering, model development, and data markets. In data engineering, it helps repair and prune corrupt data by identifying low-value samples, which can improve test error and provide insights into data quality. It also aids in pruning superfluous data, with methods like TS-DShapley for LLMs, to reduce computational costs and identify redundant information. For active learning, data valuation, particularly using Shapley values or influence functions, can efficiently select batches of unlabeled data for annotation, potentially reducing annotation costs by 10-30%. In model development, it assists in interpreting and debugging models by exploring the impact of specific training samples on predictions and performing sensitivity analysis to understand model robustness. Furthermore, data valuation can detect data manipulation, such as watermark removal and poisoning attacks, and is crucial for establishing fair pricing in two-sided data markets, connecting data providers and consumers.

Key takeaway

For AI Engineers focused on optimizing model performance and data efficiency, understanding data valuation techniques is crucial. You should explore integrating data valuation methods like Shapley values or influence functions into your data pipelines to identify and remove harmful or redundant data, thereby improving model robustness and reducing training costs. Consider using these methods for active learning strategies to make your data annotation efforts more targeted and economically viable.

Key insights

Data valuation quantifies individual training sample contributions, enabling data quality improvement, model debugging, and fair data pricing.

Principles

Low-value data often indicates corruption or redundancy.
Data value can guide efficient data acquisition.
Model sensitivity to data removal reveals intrinsic problem qualities.

Method

Data valuation typically involves computing a score for each training sample based on its contribution to model performance, then using these scores to rank, prune, or select data for various tasks.

In practice

Discard lowest-valued samples to improve model performance.
Use influence functions to prioritize label fixing for erroneous predictions.
Apply TS-DShapley for efficient LLM fine-tuning data selection.

Topics

Data Valuation
Data Engineering
Model Debugging
Data Marketplaces
Shapley Values

Code references

Best for: AI Engineer, NLP Engineer, Computer Vision Engineer, Machine Learning Engineer, Data Scientist, AI Researcher

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Blog of the TransferLab — appliedAI Institute.