Enhancing Safety of Large Language Models via Embedding Space Separation
Summary
A novel fine-tuning approach, Embedding Space Separation (ES2), significantly enhances the safety of Large Language Models (LLMs) against adversarial attacks by explicitly increasing the distance between harmful and safe latent representations in the embedding space. This method, evaluated on open-source LLMs like Llama-2-7B-Chat-hf, Llama-3-8B-Instruct, Mistral-7B-Instruct, and Qwen-2.5-7B-Instruct, leverages the linear separability of harmful and harmless queries. ES2 introduces a Kullback-Leibler (KL) divergence regularization term to maintain the model's general capabilities on harmless inputs, preventing degradation. Experiments demonstrate that ES2 substantially improves Defense Success Rate (DSR) against embedding-level attacks (RepE, Soft Prompt, SCAV) and prompt-level attacks (AutoDAN, GCG), often by 15%-30% over baselines, while preserving comparable performance on benchmarks like MMLU-Pro and GPQA. The increased perturbation distance required for attacks leads to semantic collapse, resulting in incoherent or gibberish outputs.
Key takeaway
For research scientists developing or deploying LLMs, understanding and mitigating adversarial attacks is crucial. ES2 offers a robust defense by re-engineering the embedding space, making it significantly harder for both embedding-level and prompt-level attacks to succeed. You should consider implementing ES2's fine-tuning approach, particularly for open-source models, to establish a wider safety margin and ensure outputs remain coherent and harmless, even under aggressive adversarial pressure.
Key insights
Explicitly separating harmful and harmless embeddings in LLMs enhances safety without sacrificing general capabilities.
Principles
- Harmful and harmless embeddings are linearly separable.
- Large embedding perturbations destroy semantic integrity.
Method
ES2 fine-tunes LLMs by maximizing Euclidean distance between harmful and harmless embeddings at critical layers, regularized by KL divergence to preserve general capabilities. Training terminates if KL divergence exceeds a threshold.
In practice
- Apply ES2 to open-source LLMs for robust jailbreak defense.
- Target specific semantic emergence and terminal layers for fine-tuning.
- Use KL divergence to prevent general capability degradation.
Topics
- LLM Safety
- Embedding Space
- Adversarial Attacks
- Fine-tuning
- KL Regularization
Code references
- vinid/safety-tuned-llamas
- cassidylaidlaw/hidden-context
- andyzoujm/representation-engineering
- schwinnl/llm_embedding_attack
- SproutNan/AI-Safety_SCAV
Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.