TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

2026-06-17 · Source: Artificial Intelligence · Field: Science & Research — Health & Medical Research, Life Sciences & Biology, Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

TherapeuticsBench Preclinical Pharmacology (TxBench-PP) is introduced as a verifiable benchmark for evaluating AI agents in small-molecule preclinical pharmacology, forming the initial segment of a broader TherapeuticsBench initiative. This benchmark features 100 evaluations, categorized by program stage, assay type, and task structure, covering areas like mechanism-of-action, pharmacodynamic reasoning, compound-target engagement, causal target validation, developability, safety, and translational efficacy. Agents interact with realistic workflow snapshots, inspect files in a coding environment, and provide structured answers for deterministic grading. Across 16 model-harness configurations, involving 11 models and 4,800 trajectories, no system consistently achieved reliable preclinical pharmacology decisions. The top performer, Claude Opus 4.8 / Pi, passed 59.3% of endpoint attempts (178/300), while GPT-5.5 / Pi followed at 55.3% (166/300).

Key takeaway

For AI Scientists developing or deploying agents in drug discovery, these findings highlight a significant performance gap in preclinical pharmacology. You should prioritize robust evaluation against realistic, data-driven benchmarks like TxBench-PP before trusting AI agents with critical program decisions. Focus development efforts on improving agents' ability to interpret novel assay data rather than relying on memorized knowledge.

Key insights

AI agents currently struggle to reliably make accurate preclinical pharmacology decisions from real-world assay data.

Principles

Trusted evaluation of AI agents requires realistic program decisions.
Evaluation should focus on recovering conclusions from real-world data, not memorized facts.

Method

The TxBench-PP benchmark involves agents inspecting files in a coding environment and returning structured answers for deterministic grading.

In practice

The benchmark evaluates AI agents on mechanism-of-action tasks.
It also covers pharmacodynamic reasoning and developability and safety.

Topics

AI Agents
Drug Discovery
Preclinical Pharmacology
Benchmarking
Small Molecules
Assay Data

Best for: AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.