ALIGNBEAM : Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing
Summary
ALIGNBEAM is a novel, training-free method designed to address the degradation of large language model safety caused by domain fine-tuning, particularly in cross-family specialist models. Existing inference-time defenses that mix logits from a safe anchor model are limited by requiring shared vocabularies. ALIGNBEAM overcomes this by translating anchor logits into the target model's vocabulary token-by-token during each decoding step. It then employs a small LLM judge to select the safest among K candidate continuations. This approach allows safety alignment transfer between different model families at inference time without modifying either model's weights. The method enables tuning the safety-utility trade-off at deployment without retraining and has shown substantial increases in refusal rates on adversarial benchmarks while maintaining practical task accuracy and inference overhead.
Key takeaway
For AI Security Engineers deploying fine-tuned large language models, especially cross-family specialists, ALIGNBEAM provides a critical solution to mitigate safety degradation. You can now transfer safety alignment between diverse model families at inference time without altering model weights. This allows for dynamic tuning of the safety-utility trade-off post-deployment, significantly raising refusal rates on harmful prompts while maintaining practical performance. Evaluate ALIGNBEAM to enhance the robustness of your LLM deployments against adversarial inputs.
Key insights
ALIGNBEAM enables training-free, cross-vocabulary safety alignment transfer between LLM families at inference time using logit mixing and an LLM judge.
Principles
- Safety alignment transfers cross-family.
- Inference-time defenses avoid weight changes.
- Cross-vocabulary logit mixing is feasible.
Method
ALIGNBEAM translates anchor model logits token-by-token into the target model's vocabulary at each decoding step. A small LLM judge then selects the safest among K candidate continuations, enabling cross-vocabulary logit mixing.
In practice
- Deploy safety alignment without retraining.
- Tune safety-utility trade-off post-deployment.
- Apply logit mixing to cross-family LLMs.
Topics
- ALIGNBEAM
- LLM Safety
- Inference-Time Alignment
- Logit Mixing
- Cross-Vocabulary Models
- Adversarial Robustness
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.