Sifting the Noise: A Comparative Study of LLM Agents in Vulnerability False Positive Filtering
Summary
A comparative study evaluated three LLM-based agent frameworks—Aider, OpenHands, and SWE-agent—for filtering false positives (FPs) from Static Application Security Testing (SAST) tools. Using the OWASP Benchmark (v1.2) and real-world Java projects from the Vul4J dataset, the research found that LLM agents significantly reduce SAST noise. The best configuration, SWE-agent with Claude Sonnet 4, lowered the initial FP detection rate from over 92% to 6.3% on the OWASP Benchmark, a 92.1% reduction. On real-world CodeQL alerts, agents achieved up to a 93.3% FP identification rate. However, benefits are highly dependent on the backbone model (Claude Sonnet 4 and GPT-5 showed strong gains, DeepSeek Chat less so) and vulnerability category, with data-flow-driven issues filtered more effectively than policy or cryptography-related CWEs. The study also highlighted trade-offs, noting that aggressive FP reduction risks suppressing true vulnerabilities, with a 22.25% miss rate for true positives.
Key takeaway
For AI Security Engineers evaluating LLM agents for SAST false positive filtering, prioritize robust backbone models like Claude Sonnet 4 or GPT-5; agentic frameworks significantly enhance their performance. Be cautious with automatic suppression, especially for policy- or cryptography-related vulnerabilities, where agents show higher true positive miss rates. Deploy agents as decision-support tools, focusing on data-flow-driven issues, and integrate human-in-the-loop auditing for critical categories to balance efficiency with security.
Key insights
LLM agents significantly reduce SAST false positives, but effectiveness is highly dependent on the backbone model and vulnerability category.
Principles
- Agentic reasoning amplifies strong LLM backbone model capabilities.
- Data-flow vulnerabilities are more reliably filtered than policy-based CWEs.
- Aggressive false positive reduction risks suppressing true vulnerabilities.
Method
Three LLM agent frameworks (Aider, OpenHands, SWE-agent) with Claude Sonnet 4, DeepSeek Chat, and GPT-5 backbones were compared, classifying SAST warnings as true vulnerabilities or false positives using codebase access.
In practice
- Resolve complex FPs using cross-file semantic resolution.
- Disambiguate control-flow with constant folding via calculator calls.
- Ground crypto/factory verdicts by validating configuration files.
Topics
- LLM Agents
- SAST False Positives
- Vulnerability Triage
- CodeQL
- OWASP Benchmark
- Software Security
Code references
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, AI Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.