Sifting the Noise: A Comparative Study of LLM Agents in Vulnerability False Positive Filtering

2025-08-07 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Software Development & Engineering · Depth: Expert, extended

Summary

A comparative study evaluated three LLM-based agent frameworks—Aider, OpenHands, and SWE-agent—for filtering false positives (FPs) from Static Application Security Testing (SAST) tools. Using the OWASP Benchmark (v1.2) and real-world Java projects from the Vul4J dataset, the research found that LLM agents significantly reduce SAST noise. The best configuration, SWE-agent with Claude Sonnet 4, lowered the initial FP detection rate from over 92% to 6.3% on the OWASP Benchmark, a 92.1% reduction. On real-world CodeQL alerts, agents achieved up to a 93.3% FP identification rate. However, benefits are highly dependent on the backbone model (Claude Sonnet 4 and GPT-5 showed strong gains, DeepSeek Chat less so) and vulnerability category, with data-flow-driven issues filtered more effectively than policy or cryptography-related CWEs. The study also highlighted trade-offs, noting that aggressive FP reduction risks suppressing true vulnerabilities, with a 22.25% miss rate for true positives.

Key takeaway

For AI Security Engineers evaluating LLM agents for SAST false positive filtering, prioritize robust backbone models like Claude Sonnet 4 or GPT-5; agentic frameworks significantly enhance their performance. Be cautious with automatic suppression, especially for policy- or cryptography-related vulnerabilities, where agents show higher true positive miss rates. Deploy agents as decision-support tools, focusing on data-flow-driven issues, and integrate human-in-the-loop auditing for critical categories to balance efficiency with security.

Key insights

LLM agents significantly reduce SAST false positives, but effectiveness is highly dependent on the backbone model and vulnerability category.

Principles

Agentic reasoning amplifies strong LLM backbone model capabilities.
Data-flow vulnerabilities are more reliably filtered than policy-based CWEs.
Aggressive false positive reduction risks suppressing true vulnerabilities.

Method

Three LLM agent frameworks (Aider, OpenHands, SWE-agent) with Claude Sonnet 4, DeepSeek Chat, and GPT-5 backbones were compared, classifying SAST warnings as true vulnerabilities or false positives using codebase access.

In practice

Resolve complex FPs using cross-file semantic resolution.
Disambiguate control-flow with constant folding via calculator calls.
Ground crypto/factory verdicts by validating configuration files.

Topics

LLM Agents
SAST False Positives
Vulnerability Triage
CodeQL
OWASP Benchmark
Software Security

Code references

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, AI Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.