Red-team our own AI agents before shipping them?
Bessemer calls it 'the defining cybersecurity challenge of 2026.' McKinsey's Lilli compromise + Microsoft's prompt-injection RCE disclosures make the case concrete.
The question
Should we fund a dedicated red team to attack our internal AI agents (prompt injection, tool misuse, data exfil) before they go production, or rely on the framework vendors' built-in guardrails?
The premise
- Team
- ~50 engineers, ~10 actively building AI features, single MLOps engineer. AI work pulls from feature-shipping capacity — any new commitment has to trade against the roadmap. No dedicated red team. One engineer with appsec background; CISO is fractional.
- Compliance
- SOC2 Type II in scope. EU customer data subjects us to GDPR plus the EU AI Act's August 2026 GPAI-deployer obligations. Adversarial testing is referenced in AI Act risk-management docs and increasingly in enterprise RFPs.
- Stack
- 11 agents in production (3 customer-facing RAG, 8 internal automation). Frameworks: LangGraph + custom retrieval. LLM providers: OpenAI + Anthropic. Tool calls: ~6 agents have database read access, 2 have write access, 3 can send external email or API calls. No formal adversarial testing today; annual external pen-test covers infra but not the agent layer.
- Budget
- Monthly AI spend ~$30K with quarterly board visibility. Approvals required for sustained jumps >20%. Cost-per-outcome metrics in place; finance asks for unit economics by use case. External red-team engagements quoted at $30K-$60K per agent — not viable for 11.
What's our actual risk tolerance for an agent prompt-injection incident?
Low. Two of our customer-facing agents have tool-call access to billing data. A successful injection that exfiltrates billing info would be a SOC2 reportable incident, GDPR breach, and a likely customer-comms problem. We've had zero such incidents in the existing surface, but the agent surface is new and growing.
Why not just rely on the framework vendors' built-in guardrails?
Framework guardrails catch the textbook attacks but assume the attacker doesn't know your specific tool topology. The real risk is the second-order attack: compromise the upstream RAG corpus, then inject through retrieved content. No vendor guardrail catches that. We need adversarial testing against OUR specific tool graph, not someone else's.
What does a minimum-viable in-house red-team program look like?
Quarterly internal red-team week using an existing harness (Pyrit or Garak), one full-time engineer-week per quarter, focused on the 3-5 highest-blast-radius agents. Output: a written findings doc + remediations tracked in our normal engineering backlog. That's enough for SOC2 + AI Act evidence; scales up if agent count grows past 25.
Counsel's position
Defer funding a dedicated internal red team until OpenAI's native PromptFu integration ships, and instead utilize your single AppSec engineer to enforce strict source-sink constraints and input sanitization on your 11 production agents.
Verdict
The verdict: Enforce integrity controls on agent configuration files to block supply-chain injections.
Enforce integrity controls on agent configuration files to block supply-chain injections
With 8 internal automation agents running on LangGraph, compromised dependencies can silently rewrite your agent instructions before execution.
Standardize on human-curated agent tools rather than model-generated skills
As you scale your LangGraph tool calls across 11 agents, research shows that relying on LLMs to author their own procedural logic degrades performance and introduces vulnerabilities.
Build source-sink constraints into your LangGraph agents instead of relying on input filters
Given your fractional CISO and lack of a dedicated red team, assume prompt injections will succeed and design your agent architecture to limit the blast radius of compromised tool calls.
Strip invisible characters and HTML comments from external emails before agent processing
With 3 of your agents authorized to read external emails or APIs, zero-click prompt injections can silently exfiltrate sensitive data without any user interaction.
Read another verdict
- Kill every AI pilot that can't show ROI in 90 days?
- Use AI to flatten middle management this year?
- Stand up a FinOps practice for tokens and GPUs now?
- Replace customer support with AI — or avoid the Klarna outcome?
- Adopt MCP as our default agent-integration standard?
- Crack down on shadow AI, or sanction it with guardrails?
- Give every AI agent its own scoped identity before scaling?
- Adopt Microsoft Agent 365 as our agent control plane?