ANVIL: Anomaly-based Vulnerability Identification without Labelled Training Data
Summary
ANVIL is a novel, anomaly-based software vulnerability detector that utilizes pre-trained Large Language Models (LLMs) without requiring labeled training data. It reframes vulnerability detection as an anomaly detection problem, observing that LLMs trained for code generation exhibit a significant accuracy gap when reconstructing vulnerable versus non-vulnerable code. Implemented at line-level granularity, ANVIL was evaluated on the Magma benchmark and a leakage-free 2024 CVEFixes dataset. It significantly outperforms leading supervised detectors, LineVul and LineVD, achieving \$1.62\times$ to \$2.18\times$ better Top-5 accuracies and \$1.02\times$ to \$1.29\times$ better ROC scores, despite not using labeled vulnerability data for training. Experiments showed that larger LLMs, such as CodeLlama-13B, and adaptive Maximum Compound Statement (MCS) contexts enhance detection. ANVIL's capabilities generalize to unseen vulnerabilities, demonstrating robust performance across diverse datasets and LLM architectures like CodeLlama, CodeQwen, and StarCoderBase.
Key takeaway
For AI Security Engineers developing automated vulnerability detection, ANVIL demonstrates a powerful, label-free paradigm shift. You should explore integrating anomaly-based LLM techniques, employing pre-trained models' code generation capabilities to identify deviations. This approach significantly reduces reliance on scarce, expensive labeled vulnerability datasets, offering superior performance in both classification and prioritization of unseen vulnerabilities compared to supervised methods.
Key insights
LLMs' inability to reconstruct vulnerable code accurately reveals anomalies without labeled training.
Principles
- Vulnerable code is an anomaly in LLM-predicted distributions.
- Larger LLMs improve anomaly detection discrimination.
- Adaptive context (MCS) enhances vulnerability detection.
Method
ANVIL masks code lines, uses an LLM for reconstruction, then calculates a hybrid anomaly score from reconstruction loss and exact match to identify vulnerabilities.
In practice
- Apply FIM tasks for code anomaly detection.
- Prefer MCS for context over fixed line counts.
- Combine loss and exact match for anomaly scoring.
Topics
- Software Vulnerability Detection
- Anomaly Detection
- Large Language Models
- Code Generation
- Fill-in-the-Middle
- CodeLlama
Code references
Best for: Research Scientist, AI Scientist, AI Security Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.