Enhancing Safety of Large Language Models via Embedding Space Separation

2026-03-24 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

A novel fine-tuning approach, Embedding Space Separation (ES2), significantly enhances the safety of Large Language Models (LLMs) against adversarial attacks by explicitly increasing the distance between harmful and safe latent representations in the embedding space. This method, evaluated on open-source LLMs like Llama-2-7B-Chat-hf, Llama-3-8B-Instruct, Mistral-7B-Instruct, and Qwen-2.5-7B-Instruct, leverages the linear separability of harmful and harmless queries. ES2 introduces a Kullback-Leibler (KL) divergence regularization term to maintain the model's general capabilities on harmless inputs, preventing degradation. Experiments demonstrate that ES2 substantially improves Defense Success Rate (DSR) against embedding-level attacks (RepE, Soft Prompt, SCAV) and prompt-level attacks (AutoDAN, GCG), often by 15%-30% over baselines, while preserving comparable performance on benchmarks like MMLU-Pro and GPQA. The increased perturbation distance required for attacks leads to semantic collapse, resulting in incoherent or gibberish outputs.

Key takeaway

For research scientists developing or deploying LLMs, understanding and mitigating adversarial attacks is crucial. ES2 offers a robust defense by re-engineering the embedding space, making it significantly harder for both embedding-level and prompt-level attacks to succeed. You should consider implementing ES2's fine-tuning approach, particularly for open-source models, to establish a wider safety margin and ensure outputs remain coherent and harmless, even under aggressive adversarial pressure.

Key insights

Explicitly separating harmful and harmless embeddings in LLMs enhances safety without sacrificing general capabilities.

Principles

Harmful and harmless embeddings are linearly separable.
Large embedding perturbations destroy semantic integrity.

Method

ES2 fine-tunes LLMs by maximizing Euclidean distance between harmful and harmless embeddings at critical layers, regularized by KL divergence to preserve general capabilities. Training terminates if KL divergence exceeds a threshold.

In practice

Apply ES2 to open-source LLMs for robust jailbreak defense.
Target specific semantic emergence and terminal layers for fine-tuning.
Use KL divergence to prevent general capability degradation.

Topics

LLM Safety
Embedding Space
Adversarial Attacks
Fine-tuning
KL Regularization

Code references

Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.