Revisiting Vul-RAG: Reproducibility and Replicability of RAG-based Vulnerability Detection with Open-Weight Models
Summary
A reproducibility study of Vul-RAG, a Retrieval-Augmented Generation (RAG) framework for source code vulnerability detection, confirms its reproducibility for the Qwen2.5-Coder-32B-Instruct model with minor deviations, but notes a performance drop and CUDA out-of-memory errors for DeepSeek-Coder-V2-Instruct on 21 of 586 code pairs. The study evaluated Vul-RAG across a diverse set of recent open-weight LLMs, including code-specialized, general-purpose, and reasoning models ranging from 3B to 32B parameters. Results consistently show a performance plateau at approximately 0.30 pairwise accuracy, even with newer model generations or increased parameter scales. Reasoning models achieved the highest pairwise accuracy of 0.29, though with higher computational overhead. The findings suggest that model capabilities, rather than sheer size, are more critical, and that the framework's effectiveness is capped by the inherent difficulty of the pairwise discrimination task. Implementation artifacts are publicly available.
Key takeaway
For MLOps Engineers or AI Security Engineers deploying RAG-based vulnerability detection, you should prioritize model capabilities like reasoning over raw parameter scale. Smaller open-weight models (e.g., 4B-8B) can achieve competitive performance with significantly lower computational overhead, making them more practical for on-premise deployment. Focus on improving knowledge base quality and retrieval strategies to overcome the observed performance plateau, rather than simply upgrading to larger LLMs.
Key insights
RAG-based vulnerability detection with open-weight LLMs faces a performance plateau around 0.30 pairwise accuracy.
Principles
- Reproducibility is critical for LLM-based research.
- Model capabilities can outweigh parameter scale.
- Abstraction can compensate for code specialization.
Method
Vul-RAG enhances LLMs with vulnerability knowledge via offline knowledge base construction, online retrieval of relevant items, and iterative, knowledge-augmented detection using prompts for causes and fixes.
In practice
- Evaluate smaller, resource-efficient LLMs.
- Prioritize reasoning capabilities over model size.
- Consider knowledge base quality for RAG systems.
Topics
- Retrieval-Augmented Generation
- Vulnerability Detection
- Large Language Models
- Reproducibility Study
- Open-Weight Models
- Software Security
Code references
Best for: Research Scientist, AI Scientist, AI Security Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.