Red-team our own AI agents before shipping them?

Bessemer calls it 'the defining cybersecurity challenge of 2026.' McKinsey's Lilli compromise + Microsoft's prompt-injection RCE disclosures make the case concrete.

· Counsel verdict · AIssential

The question

Should we fund a dedicated red team to attack our internal AI agents (prompt injection, tool misuse, data exfil) before they go production, or rely on the framework vendors' built-in guardrails?

The premise

Team
~50 engineers, ~10 actively building AI features, single MLOps engineer. AI work pulls from feature-shipping capacity — any new commitment has to trade against the roadmap. No dedicated red team. One engineer with appsec background; CISO is fractional.
Compliance
SOC2 Type II in scope. EU customer data subjects us to GDPR plus the EU AI Act's August 2026 GPAI-deployer obligations. Adversarial testing is referenced in AI Act risk-management docs and increasingly in enterprise RFPs.
Stack
11 agents in production (3 customer-facing RAG, 8 internal automation). Frameworks: LangGraph + custom retrieval. LLM providers: OpenAI + Anthropic. Tool calls: ~6 agents have database read access, 2 have write access, 3 can send external email or API calls. No formal adversarial testing today; annual external pen-test covers infra but not the agent layer.
Budget
Monthly AI spend ~$30K with quarterly board visibility. Approvals required for sustained jumps >20%. Cost-per-outcome metrics in place; finance asks for unit economics by use case. External red-team engagements quoted at $30K-$60K per agent — not viable for 11.

What's our actual risk tolerance for an agent prompt-injection incident?

Low. Two of our customer-facing agents have tool-call access to billing data. A successful injection that exfiltrates billing info would be a SOC2 reportable incident, GDPR breach, and a likely customer-comms problem. We've had zero such incidents in the existing surface, but the agent surface is new and growing.

Why not just rely on the framework vendors' built-in guardrails?

Framework guardrails catch the textbook attacks but assume the attacker doesn't know your specific tool topology. The real risk is the second-order attack: compromise the upstream RAG corpus, then inject through retrieved content. No vendor guardrail catches that. We need adversarial testing against OUR specific tool graph, not someone else's.

What does a minimum-viable in-house red-team program look like?

Quarterly internal red-team week using an existing harness (Pyrit or Garak), one full-time engineer-week per quarter, focused on the 3-5 highest-blast-radius agents. Output: a written findings doc + remediations tracked in our normal engineering backlog. That's enough for SOC2 + AI Act evidence; scales up if agent count grows past 25.

Counsel's position

Defer funding a dedicated internal red team until OpenAI's native PromptFu integration ships, and instead utilize your single AppSec engineer to enforce strict source-sink constraints and input sanitization on your 11 production agents.

Verdict

The verdict: Enforce integrity controls on agent configuration files to block supply-chain injections.

Enforce integrity controls on agent configuration files to block supply-chain injections

With 8 internal automation agents running on LangGraph, compromised dependencies can silently rewrite your agent instructions before execution.

Standardize on human-curated agent tools rather than model-generated skills

As you scale your LangGraph tool calls across 11 agents, research shows that relying on LLMs to author their own procedural logic degrades performance and introduces vulnerabilities.

Build source-sink constraints into your LangGraph agents instead of relying on input filters

Given your fractional CISO and lack of a dedicated red team, assume prompt injections will succeed and design your agent architecture to limit the blast radius of compromised tool calls.

Strip invisible characters and HTML comments from external emails before agent processing

With 3 of your agents authorized to read external emails or APIs, zero-click prompt injections can silently exfiltrate sensitive data without any user interaction.

Read another verdict

Get Counsel for your own decisions →