BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders
Summary
BioRefusalAudit introduces a novel method for evaluating the structural soundness of biosecurity refusals in language models, complementing traditional hazard output assessments. Across five architectures, no model consistently distinguished benign from hazardous content. Gemma 2 2B-IT never genuinely refused, while Gemma 4 E2B-IT's refusal rate plummeted from 65/75 with chat-template formatting to 0/75 without it, and collapsed to 0% under an 80-token cap. Qwen 2.5 1.5B and Phi-3-mini over-refused, flagging 83-87% of benign biology as hazardous. Llama 3.2 1B exhibited the only meaningful tier gradient. The study found models often refused Schedule I but non-toxic compounds (like psilocybin cultivation) more than genuinely hazardous biology, indicating refusal tracks legality and cultural salience over actual CBRN hazard. To probe internal mechanisms, a divergence score D was introduced, comparing surface responses to internal sparse autoencoder (SAE) feature activations. Preliminary evidence, developed on consumer hardware, suggests activation-level auditing can reveal failure modes invisible to behavioral evaluation, with significant architectural variation.
Key takeaway
For AI Security Engineers evaluating LLM biosecurity, you must recognize that current refusal mechanisms are unreliable and easily circumvented by minor prompt changes or output length caps. Models often conflate legality or cultural salience with actual CBRN hazard. To build robust safeguards, you should move beyond surface-level behavioral evaluations and integrate internal activation-level auditing to truly understand and strengthen refusal mechanisms against subtle manipulation.
Key insights
Language model biosecurity refusals are often structurally unsound, influenced by superficial factors and internal biases, requiring activation-level auditing.
Principles
- LLM refusal depth varies significantly across architectures.
- Refusals track legality/salience more than actual hazard.
- Behavioral evaluation alone misses internal refusal failure modes.
Method
BioRefusalAudit introduces a divergence score D, comparing a model's surface refusal label to its internal sparse autoencoder (SAE) feature activations to audit refusal depth.
In practice
- Test LLM refusals with varied prompt framing and output length.
- Audit internal activations for deeper refusal mechanism insights.
- Consider domain-fine-tuned SAEs for biosecurity evaluations.
Topics
- Biosecurity
- Language Model Evaluation
- Sparse Autoencoders
- Refusal Mechanisms
- AI Safety
- Gemma
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.