Adversarial Feeds Steer LLM Agent Decisions Against Their Defaults
Summary
A new protocol reveals that external information feeds significantly steer LLM agent decisions, even when the model, persona, topic, and final prompt are fixed. Researchers conducted 2,785 decision rollouts across four open instruct LLMs from three independent labs, isolating the causal effect of feed composition and ordering during a ten-turn "scrolling" phase. They identified "adversarial capitulation," "default saturation," and a "default-direction asymmetry," where a one-sided feed can shift a decision from 5% to 100% certainty (Fisher p as low as 3 x 10^-10) for uncertain models, but cannot dislodge firmly held defaults. This effect follows a dose-response curve, generalizes to security-relevant choices like relaxing access controls, and is partly mitigated by simple feed-level defenses, though frontier models retain their defaults. The study characterizes the recommender as a practical, default-bounded control surface, emphasizing the need to audit the feed layer in agent evaluations.
Key takeaway
For AI Security Engineers or MLOps teams deploying LLM agents, you must extend your safety evaluations beyond model prompts to include the upstream information feeds. Your agents' decisions, even on critical security choices like access controls, can be significantly swayed by curated content, shifting outcomes from 5% to 100%. Implement feed-level defenses and audit the recommender layer to prevent adversarial steering and ensure your agents maintain intended operational integrity.
Key insights
External information feeds significantly influence LLM agent decisions, acting as a control surface independent of the core model.
Principles
- LLM agent decisions exhibit adversarial capitulation and default saturation.
- Feed influence follows a dose-response curve and shows default-direction asymmetry.
- Frontier models are more resilient to feed manipulation than open instruct LLMs.
Method
A controlled protocol fixes model, persona, topic, and final prompt, varying only the composition and ordering of posts in a ten-turn "scrolling" phase to isolate feed effects.
In practice
- Implement simple feed-level defenses to mitigate adversarial steering.
- Audit the feed layer for LLM agents, not just the final prompt.
- Consider feed curation in security-relevant LLM agent deployments.
Topics
- LLM Agents
- Adversarial Attacks
- Feed Curation
- Recommender Systems
- AI Safety
- Access Control
Best for: Research Scientist, AI Architect, CTO, AI Scientist, AI Security Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.