ANVIL: Anomaly-based Vulnerability Identification without Labelled Training Data

2026-06-30 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Software Development & Engineering · Depth: Advanced, extended

Summary

ANVIL is a novel, anomaly-based software vulnerability detector that utilizes pre-trained Large Language Models (LLMs) without requiring labeled training data. It reframes vulnerability detection as an anomaly detection problem, observing that LLMs trained for code generation exhibit a significant accuracy gap when reconstructing vulnerable versus non-vulnerable code. Implemented at line-level granularity, ANVIL was evaluated on the Magma benchmark and a leakage-free 2024 CVEFixes dataset. It significantly outperforms leading supervised detectors, LineVul and LineVD, achieving \$1.62\times$ to \$2.18\times$ better Top-5 accuracies and \$1.02\times$ to \$1.29\times$ better ROC scores, despite not using labeled vulnerability data for training. Experiments showed that larger LLMs, such as CodeLlama-13B, and adaptive Maximum Compound Statement (MCS) contexts enhance detection. ANVIL's capabilities generalize to unseen vulnerabilities, demonstrating robust performance across diverse datasets and LLM architectures like CodeLlama, CodeQwen, and StarCoderBase.

Key takeaway

For AI Security Engineers developing automated vulnerability detection, ANVIL demonstrates a powerful, label-free paradigm shift. You should explore integrating anomaly-based LLM techniques, employing pre-trained models' code generation capabilities to identify deviations. This approach significantly reduces reliance on scarce, expensive labeled vulnerability datasets, offering superior performance in both classification and prioritization of unseen vulnerabilities compared to supervised methods.

Key insights

LLMs' inability to reconstruct vulnerable code accurately reveals anomalies without labeled training.

Principles

Vulnerable code is an anomaly in LLM-predicted distributions.
Larger LLMs improve anomaly detection discrimination.
Adaptive context (MCS) enhances vulnerability detection.

Method

ANVIL masks code lines, uses an LLM for reconstruction, then calculates a hybrid anomaly score from reconstruction loss and exact match to identify vulnerabilities.

In practice

Apply FIM tasks for code anomaly detection.
Prefer MCS for context over fixed line counts.
Combine loss and exact match for anomaly scoring.

Topics

Software Vulnerability Detection
Anomaly Detection
Large Language Models
Code Generation
Fill-in-the-Middle
CodeLlama

Code references

tree-sitter/tree-sitter

Best for: Research Scientist, AI Scientist, AI Security Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.