Snyk VulnBench JS 1.0: Can LLMs Find the Same Bugs Twice?

2026-06-14 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Software Development & Engineering · Depth: Expert, quick

Summary

Snyk VulnBench JS 1.0 investigated the repeatability of agentic large language model (LLM) security reviews on JavaScript code, conducting 300 repeated vulnerability-finding scans. The study found LLM security findings were unevenly repeatable: reference-matched findings were stable, with 134 of 158 unique findings appearing in all five repetitions. However, extra model reports varied heavily, as 80 of 161 unique unmatched findings appeared in only one of five identical repetitions, and only 22 appeared in all five. The benchmark also highlighted complementarity, with models consistently identifying familiar, high-signal exploit shapes and even surfacing a potential Snyk Code product gap. Snyk Code static application security testing (SAST) proved deterministic and superior at systematically enumerating repeated data-flow sinks. The results advocate for combining agentic LLM review with deterministic SAST.

Key takeaway

For AI Security Engineers evaluating code analysis tools, you should integrate agentic LLM review with deterministic SAST. This approach leverages LLMs' ability to identify novel exploit shapes while relying on SAST for systematic enumeration of data-flow sinks, mitigating the variability of LLM-generated reports. Your security pipeline will benefit from this combined strategy, enhancing overall vulnerability detection.

Key insights

LLM security review repeatability varies significantly, with reference-matched findings stable but additional reports highly inconsistent.

Principles

LLM security findings are unevenly repeatable.
Combine agentic LLM review with deterministic SAST.

Method

The study involved 300 repeated vulnerability-finding scans on JavaScript code using the same prompt and benchmark harness to measure LLM security review consistency.

In practice

LLMs can find high-signal exploit shapes.
SAST excels at enumerating data-flow sinks.

Topics

Snyk VulnBench JS
LLM Security Review
Static Application Security Testing
Vulnerability Finding
JavaScript Security
Code Analysis Repeatability

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Security Engineer, AI Scientist, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.