Snyk VulnBench JS 1.0: Can LLMs Find the Same Bugs Twice?

2026-06-14 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Advanced, long

Summary

Snyk VulnBench JS 1.0, released June 11, 2026, quantifies the repeatability of agentic large language model (LLM) security reviews on JavaScript code. The benchmark involved 300 repeated vulnerability-finding scans across 10 small JavaScript and Express applications, using six Claude model configurations against Snyk Code's deterministic SAST reference set. Findings revealed LLM security reports were unevenly repeatable: 134 of 158 unique reference-matched findings were stable across all five runs, but 80 of 161 unique unmatched findings appeared in only one of five identical repetitions. The highest-recall LLM configuration found 81% of Snyk Code reference vulnerabilities, with the best-scoring Claude Opus 4.6 Medium achieving 75.4% Snyk-reference F1. LLMs excelled at familiar exploit shapes like SQL injection but struggled with systematic SAST classes such as resource-limit and path traversal issues. More expensive models like Claude Opus 4.7 Max performed worse, costing 5.7x more and scoring 68.8% F1 compared to Opus 4.6 Medium's 75.4%. The study concludes that combining LLM review with deterministic SAST is complementary.

Key takeaway

For AI Security Engineers integrating LLMs into your security review, recognize that LLM findings are not uniformly repeatable. While LLMs consistently identify known vulnerabilities, their unique, non-reference reports are often one-offs. You should combine agentic LLM review with deterministic SAST to cover different vulnerability classes effectively. Prioritize LLM configurations that demonstrate higher stability for reference-matched findings and fewer noisy, unmatched reports to reduce triage overhead. Evaluate LLM cost-performance carefully, as more expensive models do not guarantee superior security coverage.

Key insights

LLM security findings are unevenly repeatable; combine with SAST for comprehensive coverage.

Principles

LLM \"true positives\" are stable across runs.
LLM \"extra reports\" are often one-off and noisy.
LLMs and SAST exhibit complementary blind spots.

Method

The benchmark ran 300 repeated scans of 10 JavaScript fixtures with 6 Claude configurations, using a direct audit prompt and Snyk Code as the deterministic reference.

In practice

Integrate LLM review for high-signal exploit shapes.
Utilize SAST for systematic data-flow enumeration.
Evaluate LLM cost-performance for security tasks.

Topics

LLM Security Review
SAST
Vulnerability Detection
LLM Repeatability
Claude Models
JavaScript Security

Best for: CTO, AI Engineer, AI Product Manager, AI Security Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.