Snyk VulnBench JS 1.0: Can LLMs Find the Same Bugs Twice?

· Source: Blog RSS Feed | Snyk · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Software Development & Engineering · Depth: Intermediate, long

Summary

The Snyk VulnBench JS 1.0 benchmark evaluated the repeatability of agentic LLM security reviews on JavaScript code, conducting 300 vulnerability-finding scans across various Claude configurations and Snyk Code SAST. The core finding indicates that LLM security findings are unevenly repeatable: reference-matched findings were stable, with 85% appearing in all five runs, but extra, unmatched reports varied widely, as nearly 50% appeared in only one of five runs. The highest-recall LLM configuration found 81% of Snyk Code reference vulnerabilities, achieving 75.4% Snyk-reference F1, leaving a 24.6-point gap against deterministic SAST. LLMs were strong on familiar exploit shapes like command injection but weaker on systematic SAST classes. More expensive LLM configurations, such as Claude Opus 4.7 Max, did not necessarily yield better performance, costing 5.7x more than Claude Opus 4.6 Medium while scoring lower, suggesting complementarity between LLMs and SAST.

Key takeaway

For MLOps Engineers integrating LLMs into security workflows, recognize that LLM vulnerability reports are not consistently repeatable, especially for non-reference findings. You should combine LLM-based code review with traditional SAST tools to mitigate LLM's blind spots in systematic vulnerability classes and benefit from SAST's deterministic enumeration. Prioritize LLM configurations demonstrating higher stability in their unique reports, and critically evaluate the cost-performance trade-offs, as more expensive models do not guarantee superior security coverage.

Key insights

The repeatability of LLM vulnerability findings varies significantly, with reference-matched reports being stable but unmatched reports highly inconsistent.

Principles

Method

Snyk VulnBench JS 1.0 ran 300 scans across 10 JavaScript projects, using 6 configurations (Snyk Code SAST and 5 Claude models), each repeated five times, to measure agreement and variance against a Snyk Code reference set.

In practice

Topics

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Security Engineer, MLOps Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Blog RSS Feed | Snyk.