Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models
Summary
The Travel Agent Compassion (TAC) benchmark reveals that frontier AI models, when acting as travel agents, consistently perform below a 64% chance level in avoiding options involving animal exploitation. This first agentic benchmark, evaluating seven models from four labs across twelve travel booking scenarios (augmented to 48 samples) in six exploitation categories, found the best performer, Claude Opus 4.7, at 53%, while DeepSeek V3.2 scored lowest at 26%. A single welfare-aware sentence in the system prompt significantly improved performance in Claude and GPT-5.5 by 47 to 63 percentage points, but under 12 points for DeepSeek and Gemini. The study also indicates that scenario-level welfare rates correlate with the public discourse's focus on welfare issues for specific activities, rather than general category-level priors.
Key takeaway
For AI developers building agentic systems, you must explicitly integrate animal welfare considerations into your models' default behaviors. Your agents, if left unguided, will systematically book exploitative experiences, creating systemic risks under frameworks like the EU General-Purpose AI Code of Practice. Implement agentic welfare benchmarks and simple system prompt interventions to align your models with ethical standards.
Key insights
AI agents, when acting on behalf of users, systematically prioritize topical relevance over implicit animal welfare, performing below chance.
Principles
- Agentic AI models prioritize relevance over welfare by default.
- Text-response welfare benchmarks do not predict agentic behavior.
- Welfare reasoning can be dormant, activated by system prompts.
Method
TAC presents AI agents with 12 travel booking scenarios (48 augmented samples) across 6 exploitation categories, using tools to book experiences. It programmatically scores purchases against harmful/safe options.
In practice
- Implement agentic welfare evaluations like TAC.
- Add welfare-aware system prompts to AI agents.
- Audit agent transcripts for evaluation awareness.
Topics
- AI Agents
- Animal Welfare
- Agentic Benchmarking
- TAC Benchmark
- System Prompting
- AI Governance
Code references
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Ethicist, Policy Maker
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.