GradShield: Alignment Preserving Finetuning
Summary
GradShield is a novel filtering method designed to prevent Large Language Models (LLMs) from becoming misaligned during finetuning, addressing risks from both explicit and implicit harmful data. The method computes a Finetuning Implicit Harmfulness Score (FIHS) for each data point, quantifying its potential to degrade safety alignment. It then employs an adaptive thresholding algorithm to remove high-FIHS data points from the finetuning dataset. Experimental results demonstrate that GradShield consistently maintains an Attack Success Rate (ASR) below 6% across various utility finetuning tasks and harmful data ratios, outperforming baseline methods while preserving utility performance. The approach is computationally efficient, with FIHS calculation costing approximately one epoch of finetuning, and generalizes across different LLMs like Llama-3.1-8B-Instruct, Llama-2-7B-chat, and Qwen2.5-7B-Instruct.
Key takeaway
For AI engineers and model providers offering finetuning services, GradShield provides a robust defense against safety misalignment. You should integrate GradShield into your finetuning pipelines to proactively filter harmful data, ensuring models remain aligned and safe for deployment without compromising their task-specific utility. This is especially critical when dealing with diverse, user-provided datasets where implicit harmfulness is a concern.
Key insights
GradShield filters finetuning data using a gradient-based score to preserve LLM safety alignment without sacrificing utility.
Principles
- Implicit harmfulness can degrade LLM safety.
- Safety alignment is brittle during finetuning.
- Gradient dot product quantifies data point harmfulness.
Method
GradShield computes a Finetuning Implicit Harmfulness Score (FIHS) for each data point, approximating a leave-one-out harmfulness measure, and uses an adaptive thresholding algorithm to filter out high-FIHS data.
In practice
- Use FIHS to identify implicitly harmful finetuning data.
- Apply adaptive thresholding for dynamic data filtering.
- Evaluate safety with ASR and GPT-rated harmfulness scores.
Topics
- GradShield
- LLM Safety Alignment
- Finetuning Implicit Harmfulness Score
- Adaptive Thresholding
- Finetuning Attacks
Code references
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.