ALIGNBEAM : Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing

· Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

ALIGNBEAM is a novel, training-free method addressing the safety degradation in large language models (LLMs) caused by domain fine-tuning, particularly for cross-family specialists. This inference-time defense overcomes the limitation of existing techniques that require models to share a vocabulary. ALIGNBEAM functions by translating logits from a safe anchor model into the target model's vocabulary token-by-token during each decoding step. A small LLM judge then evaluates and selects the safest among K candidate continuations. The method requires no weight changes, allowing the safety-utility trade-off to be tuned at deployment without retraining. Evaluations show ALIGNBEAM significantly increases refusal rates on adversarial benchmarks across both cross-vocabulary and same-vocabulary pairs, while maintaining practical task accuracy and inference overhead.

Key takeaway

For AI Security Engineers or ML teams deploying fine-tuned LLMs, ALIGNBEAM offers a critical solution to mitigate safety degradation without retraining. You can transfer safety alignment from a robust anchor model to a specialist, even across different vocabularies. Integrate this training-free, inference-time logit mixing method. This enables dynamic tuning of safety-utility trade-offs post-deployment, improving refusal rates on harmful prompts while maintaining task accuracy.

Key insights

ALIGNBEAM enables training-free, inference-time safety transfer between LLMs, even across different vocabularies, using logit mixing and an LLM judge.

Principles

Method

ALIGNBEAM translates anchor model logits into the target model's vocabulary token-by-token at each decoding step, then a small LLM judge selects the safest among K candidate continuations.

In practice

Topics

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.