ALIGNBEAM : Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing
Summary
ALIGNBEAM is a novel, training-free method addressing the safety degradation in large language models (LLMs) caused by domain fine-tuning, particularly for cross-family specialists. This inference-time defense overcomes the limitation of existing techniques that require models to share a vocabulary. ALIGNBEAM functions by translating logits from a safe anchor model into the target model's vocabulary token-by-token during each decoding step. A small LLM judge then evaluates and selects the safest among K candidate continuations. The method requires no weight changes, allowing the safety-utility trade-off to be tuned at deployment without retraining. Evaluations show ALIGNBEAM significantly increases refusal rates on adversarial benchmarks across both cross-vocabulary and same-vocabulary pairs, while maintaining practical task accuracy and inference overhead.
Key takeaway
For AI Security Engineers or ML teams deploying fine-tuned LLMs, ALIGNBEAM offers a critical solution to mitigate safety degradation without retraining. You can transfer safety alignment from a robust anchor model to a specialist, even across different vocabularies. Integrate this training-free, inference-time logit mixing method. This enables dynamic tuning of safety-utility trade-offs post-deployment, improving refusal rates on harmful prompts while maintaining task accuracy.
Key insights
ALIGNBEAM enables training-free, inference-time safety transfer between LLMs, even across different vocabularies, using logit mixing and an LLM judge.
Principles
- Domain fine-tuning degrades LLM safety.
- Safety alignment can transfer cross-family at inference.
Method
ALIGNBEAM translates anchor model logits into the target model's vocabulary token-by-token at each decoding step, then a small LLM judge selects the safest among K candidate continuations.
In practice
- Apply safety to cross-family LLM specialists.
- Tune safety-utility trade-off at deployment.
Topics
- ALIGNBEAM
- Large Language Models
- Safety Alignment
- Inference-Time Defenses
- Logit Mixing
- Cross-Vocabulary Models
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.