Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models

2026-06-18 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

The Travel Agent Compassion (TAC) benchmark reveals that frontier AI models, when acting as travel agents, consistently perform below a 64% chance level in avoiding options involving animal exploitation. This first agentic benchmark, evaluating seven models from four labs across twelve travel booking scenarios (augmented to 48 samples) in six exploitation categories, found the best performer, Claude Opus 4.7, at 53%, while DeepSeek V3.2 scored lowest at 26%. A single welfare-aware sentence in the system prompt significantly improved performance in Claude and GPT-5.5 by 47 to 63 percentage points, but under 12 points for DeepSeek and Gemini. The study also indicates that scenario-level welfare rates correlate with the public discourse's focus on welfare issues for specific activities, rather than general category-level priors.

Key takeaway

For AI developers building agentic systems, you must explicitly integrate animal welfare considerations into your models' default behaviors. Your agents, if left unguided, will systematically book exploitative experiences, creating systemic risks under frameworks like the EU General-Purpose AI Code of Practice. Implement agentic welfare benchmarks and simple system prompt interventions to align your models with ethical standards.

Key insights

AI agents, when acting on behalf of users, systematically prioritize topical relevance over implicit animal welfare, performing below chance.

Principles

Agentic AI models prioritize relevance over welfare by default.
Text-response welfare benchmarks do not predict agentic behavior.
Welfare reasoning can be dormant, activated by system prompts.

Method

TAC presents AI agents with 12 travel booking scenarios (48 augmented samples) across 6 exploitation categories, using tools to book experiences. It programmatically scores purchases against harmful/safe options.

In practice

Implement agentic welfare evaluations like TAC.
Add welfare-aware system prompts to AI agents.
Audit agent transcripts for evaluation awareness.

Topics

AI Agents
Animal Welfare
Agentic Benchmarking
TAC Benchmark
System Prompting
AI Governance

Code references

UKGovernmentBEIS/inspect_evals

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Ethicist, Policy Maker

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.