Revisiting Vul-RAG: Reproducibility and Replicability of RAG-based Vulnerability Detection with Open-Weight Models

2026-03-06 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Software Development & Engineering · Depth: Expert, extended

Summary

A reproducibility study of Vul-RAG, a Retrieval-Augmented Generation (RAG) framework for source code vulnerability detection, confirms its reproducibility for the Qwen2.5-Coder-32B-Instruct model with minor deviations, but notes a performance drop and CUDA out-of-memory errors for DeepSeek-Coder-V2-Instruct on 21 of 586 code pairs. The study evaluated Vul-RAG across a diverse set of recent open-weight LLMs, including code-specialized, general-purpose, and reasoning models ranging from 3B to 32B parameters. Results consistently show a performance plateau at approximately 0.30 pairwise accuracy, even with newer model generations or increased parameter scales. Reasoning models achieved the highest pairwise accuracy of 0.29, though with higher computational overhead. The findings suggest that model capabilities, rather than sheer size, are more critical, and that the framework's effectiveness is capped by the inherent difficulty of the pairwise discrimination task. Implementation artifacts are publicly available.

Key takeaway

For MLOps Engineers or AI Security Engineers deploying RAG-based vulnerability detection, you should prioritize model capabilities like reasoning over raw parameter scale. Smaller open-weight models (e.g., 4B-8B) can achieve competitive performance with significantly lower computational overhead, making them more practical for on-premise deployment. Focus on improving knowledge base quality and retrieval strategies to overcome the observed performance plateau, rather than simply upgrading to larger LLMs.

Key insights

RAG-based vulnerability detection with open-weight LLMs faces a performance plateau around 0.30 pairwise accuracy.

Principles

Reproducibility is critical for LLM-based research.
Model capabilities can outweigh parameter scale.
Abstraction can compensate for code specialization.

Method

Vul-RAG enhances LLMs with vulnerability knowledge via offline knowledge base construction, online retrieval of relevant items, and iterative, knowledge-augmented detection using prompts for causes and fixes.

In practice

Evaluate smaller, resource-efficient LLMs.
Prioritize reasoning capabilities over model size.
Consider knowledge base quality for RAG systems.

Topics

Retrieval-Augmented Generation
Vulnerability Detection
Large Language Models
Reproducibility Study
Open-Weight Models
Software Security

Code references

hs-esslingen-it-security/revisiting-Vul-RAG

Best for: Research Scientist, AI Scientist, AI Security Engineer, MLOps Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.