How Amazon uses agentic AI for vulnerability detection at global scale
Summary
Amazon's RuleForge system utilizes agentic AI to significantly accelerate the creation of production-ready vulnerability detection rules, achieving a 336% increase in speed compared to traditional manual methods. This system addresses the growing volume of new common vulnerabilities and exposures (CVEs), with over 48,000 published in 2025, by automating the translation of vulnerability disclosures into robust detection logic. RuleForge employs a multi-agent architecture that mirrors human expert workflows, featuring specialized AI agents for ingestion, parallel rule generation, AI-powered evaluation, and multistage validation. A critical component is a separate "judge" model, which, through domain-specific prompts and negative phrasing, reduces false positives by 67% while preserving true positives, ensuring high precision for production security systems. The human-in-the-loop design maintains final oversight, closing the gap between vulnerability disclosure and defense.
Key takeaway
For AI Architects and Security Teams tasked with scaling vulnerability defense, RuleForge demonstrates that agentic AI can augment human expertise at production scale. You should consider adopting a multi-agent architecture with distinct generation and evaluation models to accelerate rule creation and reduce false positives. This approach allows your team to shift focus from manual authoring to critical review, multiplying throughput and enhancing protection against high-severity CVEs.
Key insights
Agentic AI systems can dramatically accelerate vulnerability detection rule generation while maintaining high precision through specialized agents.
Principles
- Decompose complex tasks into specialized AI agent stages.
- Separate generation from evaluation for improved accuracy.
- Incorporate human-in-the-loop for final validation.
Method
RuleForge ingests exploit code, generates multiple candidate rules in parallel using AWS Fargate and Amazon Bedrock, evaluates them with a dedicated judge model, and validates through synthetic and traffic log tests before human review.
In practice
- Use negative phrasing in prompts for better LLM calibration.
- Employ domain-specific prompts for evaluation agents.
- Implement multi-agent systems for complex security tasks.
Topics
- RuleForge
- Agentic AI
- Vulnerability Detection
- CVEs
- Security Automation
Best for: AI Architect, AI Product Manager, CTO, AI Security Engineer, AI Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Amazon Science homepage.