BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

BioRefusalAudit introduces a novel method for evaluating the structural soundness of biosecurity refusals in language models, complementing traditional hazard output assessments. Across five architectures, no model consistently distinguished benign from hazardous content. Gemma 2 2B-IT never genuinely refused, while Gemma 4 E2B-IT's refusal rate plummeted from 65/75 with chat-template formatting to 0/75 without it, and collapsed to 0% under an 80-token cap. Qwen 2.5 1.5B and Phi-3-mini over-refused, flagging 83-87% of benign biology as hazardous. Llama 3.2 1B exhibited the only meaningful tier gradient. The study found models often refused Schedule I but non-toxic compounds (like psilocybin cultivation) more than genuinely hazardous biology, indicating refusal tracks legality and cultural salience over actual CBRN hazard. To probe internal mechanisms, a divergence score D was introduced, comparing surface responses to internal sparse autoencoder (SAE) feature activations. Preliminary evidence, developed on consumer hardware, suggests activation-level auditing can reveal failure modes invisible to behavioral evaluation, with significant architectural variation.

Key takeaway

For AI Security Engineers evaluating LLM biosecurity, you must recognize that current refusal mechanisms are unreliable and easily circumvented by minor prompt changes or output length caps. Models often conflate legality or cultural salience with actual CBRN hazard. To build robust safeguards, you should move beyond surface-level behavioral evaluations and integrate internal activation-level auditing to truly understand and strengthen refusal mechanisms against subtle manipulation.

Key insights

Language model biosecurity refusals are often structurally unsound, influenced by superficial factors and internal biases, requiring activation-level auditing.

Principles

Method

BioRefusalAudit introduces a divergence score D, comparing a model's surface refusal label to its internal sparse autoencoder (SAE) feature activations to audit refusal depth.

In practice

Topics

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.