TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology
Summary
TherapeuticsBench Preclinical Pharmacology (TxBench-PP) is introduced as a verifiable benchmark for evaluating AI agents in small-molecule preclinical pharmacology, forming the initial segment of a broader TherapeuticsBench initiative. This benchmark features 100 evaluations, categorized by program stage, assay type, and task structure, covering areas like mechanism-of-action, pharmacodynamic reasoning, compound-target engagement, causal target validation, developability, safety, and translational efficacy. Agents interact with realistic workflow snapshots, inspect files in a coding environment, and provide structured answers for deterministic grading. Across 16 model-harness configurations, involving 11 models and 4,800 trajectories, no system consistently achieved reliable preclinical pharmacology decisions. The top performer, Claude Opus 4.8 / Pi, passed 59.3% of endpoint attempts (178/300), while GPT-5.5 / Pi followed at 55.3% (166/300).
Key takeaway
For AI Scientists developing or deploying agents in drug discovery, these findings highlight a significant performance gap in preclinical pharmacology. You should prioritize robust evaluation against realistic, data-driven benchmarks like TxBench-PP before trusting AI agents with critical program decisions. Focus development efforts on improving agents' ability to interpret novel assay data rather than relying on memorized knowledge.
Key insights
AI agents currently struggle to reliably make accurate preclinical pharmacology decisions from real-world assay data.
Principles
- Trusted evaluation of AI agents requires realistic program decisions.
- Evaluation should focus on recovering conclusions from real-world data, not memorized facts.
Method
The TxBench-PP benchmark involves agents inspecting files in a coding environment and returning structured answers for deterministic grading.
In practice
- The benchmark evaluates AI agents on mechanism-of-action tasks.
- It also covers pharmacodynamic reasoning and developability and safety.
Topics
- AI Agents
- Drug Discovery
- Preclinical Pharmacology
- Benchmarking
- Small Molecules
- Assay Data
Best for: AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.