CSO-LLM: Class Subspace Orthogonalization for Post-Training Backdoor Detection and Trigger Inversion in LLMs

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

The CSO-LLM framework introduces a novel approach for post-training backdoor detection and trigger inversion in Large Language Models. Addressing challenges like LLMs' discrete input space, which can involve up to 150,000^k k-tuples for a k-token trigger, and the general unavailability of comprehensive blacklists for target response tokens, CSO-LLM treats LLMs as classifiers. Central to its design is Class Subspace Orthogonalization (CSO), a plug-and-play paradigm that enhances a baseline detector's sensitivity and specificity. CSO also provides implicit blacklisting by penalizing candidate trigger tokens that induce signal perturbations aligned with the putative target class. The framework includes a version for continuous optimization in token embedding space and a companion method for greedy accretion in discrete token space, demonstrating strong detection and accurate trigger inversion across various LLM architectures and classification domains.

Key takeaway

For AI Security Engineers deploying or evaluating Large Language Models, CSO-LLM offers a critical new capability for identifying and mitigating post-training backdoors. You should evaluate integrating this framework to enhance the robustness of your models against adversarial attacks, especially given its effectiveness across various LLM architectures and classification domains. This can significantly improve the security posture of your deployed AI systems.

Key insights

CSO-LLM uses class subspace orthogonalization to detect and invert backdoors in LLMs, addressing discrete input and blacklisting challenges.

Principles

Method

The framework employs continuous optimization in token embedding space and greedy accretion in discrete token space for robust backdoor detection and trigger inversion.

In practice

Topics

Best for: Research Scientist, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.