DataShield: Safety-degrading Data Filtering for LLM Benign Instruction Fine-Tuning
Summary
DataShield is a novel method designed to efficiently identify safety-degrading samples within benign datasets used for fine-tuning large language models (LLMs). Existing methods often suffer from high computational costs and noise. DataShield's core intuition is that benign fine-tuning increases overall LLM response compliance. It quantifies each sample's contribution to this compliance behavior as a safety degradation score. Its core components include Compliance Vector Extraction and a Compliance-Aware Score (CAS) for optimal safety-critical layer identification. A Safety-degrading Sample Filtering component then quantifies the training data's projection shift along the compliance direction. Evaluations on Llama3-8B, Llama3.1-8B, and Qwen2.5-7B using Alpaca and Dolly datasets confirm its effectiveness. It identifies high-risk and low-risk data subsets. The research also notes that open-ended question answering often triggers safety degradation, leading to longer responses.
Key takeaway
ML engineers fine-tuning LLMs with benign instruction datasets should integrate data filtering methods like DataShield. This proactively identifies and mitigates safety degradation, ensuring model compliance by pinpointing high-risk samples before deployment. You should also prioritize safety reviews for open-ended question answering data, as it is more prone to triggering safety issues.
Key insights
DataShield efficiently identifies safety-degrading samples in benign LLM fine-tuning datasets by quantifying compliance behavior.
Principles
- Benign fine-tuning increases LLM compliance.
- Quantify sample contribution to compliance.
- Open-ended Q&A often degrades safety.
Method
DataShield extracts compliance vectors, uses a Compliance-Aware Score (CAS) to find the optimal safety-critical layer, then filters samples by quantifying projection shift along the compliance direction.
In practice
- Filter benign datasets for safety degradation.
- Identify high-risk and low-risk data subsets.
- Prioritize safety review for open-ended Q&A.
Topics
- LLM Safety
- Data Filtering
- Instruction Fine-Tuning
- Compliance-Aware Score
- Llama3
- Qwen2.5
Code references
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.