DOG-DPO:Dynamic Optimization in Geometry for Safety Alignment
Summary
DOG-DPO is a novel training-free data selection framework designed to improve safety alignment for large language models by optimizing preference data. Current methods often use large, redundant datasets and independently score preference pairs, losing crucial directional information. DOG-DPO addresses this by treating preference pairs as structured geometric signals, representing each as a direction within the model's representation space. It then decomposes multi-dataset preference geometry into a global anchor subspace and dataset-specific residual subspaces. The framework selects optimal data subsets by maximizing diversity-based coverage of these alignment directions prior to DPO training. This approach enables DOG-DPO to achieve a strong utility-robustness trade-off, utilizing only 11% of preference pairs across six safety benchmarks and two model backbones. It recovers most of the safety gains observed with full-data training, while being entirely teacher-free, training-free, and significantly faster than existing selection baselines.
Key takeaway
For Machine Learning Engineers optimizing large language model safety alignment, DOG-DPO offers a compelling solution to reduce data dependency. You can recover significant safety gains using only 11% of your preference data, drastically cutting training time and computational resources. This training-free, teacher-free framework helps you achieve a strong utility-robustness trade-off. It makes your alignment pipelines more efficient and scalable without sacrificing performance. Consider integrating geometric data selection to streamline your DPO workflows.
Key insights
DOG-DPO geometrically analyzes preference pairs to select diverse, non-redundant subsets, significantly improving LLM safety alignment efficiency.
Principles
- Preference pairs are structured geometric signals.
- Decompose multi-dataset geometry into global and residual subspaces.
- Maximize diversity-based coverage of alignment directions.
Method
DOG-DPO represents preference pairs as directions in model representation space, decomposes multi-dataset geometry into global and residual subspaces, then selects subsets by maximizing diversity-based coverage before DPO training.
In practice
- Reduce DPO preference data by 89%.
- Apply geometric data selection for LLM safety.
- Improve utility-robustness trade-off.
Topics
- Large Language Models
- Safety Alignment
- Data Selection
- Preference Data
- Geometric Machine Learning
- DPO Training
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.