EpiBench: Verifiable Evaluation of AI Agents on Epigenomics Analysis
Summary
EpiBench is a new verifiable benchmark designed to evaluate AI agents' performance in short-horizon epigenomics analysis. It comprises 106 evaluations across CUT&Tag/CUT&RUN, ATAC-seq, ChIP-seq, and DNA methylation workflows, requiring agents to make well-defined analysis decisions from realistic workflow states and return deterministically gradable answers. Across 5,088 valid trajectories from 16 model-harness pairs, no system achieved a majority pass rate. The top performer, GPT-5.5 / Pi, passed 45.0% (143/318 attempts), followed by GPT-5.5 / OpenAI Codex at 39.9% (127/318 attempts), and Claude Opus 4.8 Max / Pi and GPT-5.4 / Pi both at 39.0% (124/318 attempts). Failures often stemmed from agents struggling with assay-specific scientific judgment, despite finding correct files or computing intermediate results. Performance varied by assay type, with CUT&Tag/CUT&RUN showing the highest aggregate pass rate at 34.0%.
Key takeaway
For AI Scientists and Machine Learning Engineers developing agents for biological data analysis, you must prioritize grounding models in assay-specific evidence. Current agents often fail on nuanced scientific judgment, even when executing tools correctly. Focus on improving agents' ability to interpret specific data artifacts rather than relying on generic workflow defaults or literature priors. This will enhance reliability and reduce errors in critical epigenomics analysis tasks.
Key insights
AI agents struggle with scientific judgment in epigenomics analysis, often failing on assay-specific decisions despite partial progress.
Principles
- Verifiable benchmarks require deterministic grading.
- Scientific durability ensures reasoning, not implementation.
- Tasks must resist trivial shortcuts or prior knowledge.
Method
EpiBench constructs evaluations from real epigenomics workflows, isolating gradeable decisions with snapshot workflow states, metadata, and deterministic graders.
In practice
- Review agent trajectories for specific scientific judgment failures.
- Focus on grounding biological claims in assay artifacts.
Topics
- Epigenomics Analysis
- AI Agents
- Benchmark Evaluation
- CUT&Tag/CUT&RUN
- ATAC-seq
- DNA Methylation
Code references
Best for: AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.