GradShield: Alignment Preserving Finetuning

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

GradShield is a novel filtering method designed to prevent Large Language Models (LLMs) from becoming misaligned during finetuning, addressing risks from both explicit and implicit harmful data. The method computes a Finetuning Implicit Harmfulness Score (FIHS) for each data point, quantifying its potential to degrade safety alignment. It then employs an adaptive thresholding algorithm to remove high-FIHS data points from the finetuning dataset. Experimental results demonstrate that GradShield consistently maintains an Attack Success Rate (ASR) below 6% across various utility finetuning tasks and harmful data ratios, outperforming baseline methods while preserving utility performance. The approach is computationally efficient, with FIHS calculation costing approximately one epoch of finetuning, and generalizes across different LLMs like Llama-3.1-8B-Instruct, Llama-2-7B-chat, and Qwen2.5-7B-Instruct.

Key takeaway

For AI engineers and model providers offering finetuning services, GradShield provides a robust defense against safety misalignment. You should integrate GradShield into your finetuning pipelines to proactively filter harmful data, ensuring models remain aligned and safe for deployment without compromising their task-specific utility. This is especially critical when dealing with diverse, user-provided datasets where implicit harmfulness is a concern.

Key insights

GradShield filters finetuning data using a gradient-based score to preserve LLM safety alignment without sacrificing utility.

Principles

Method

GradShield computes a Finetuning Implicit Harmfulness Score (FIHS) for each data point, approximating a leave-one-out harmfulness measure, and uses an adaptive thresholding algorithm to filter out high-FIHS data.

In practice

Topics

Code references

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.