EpiBench: Verifiable Evaluation of AI Agents on Epigenomics Analysis

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Science & Research — Life Sciences & Biology, Research Methodology & Innovation, AI in Life Sciences · Depth: Expert, long

Summary

EpiBench is a new verifiable benchmark designed to evaluate AI agents' performance in short-horizon epigenomics analysis. It comprises 106 evaluations across CUT&Tag/CUT&RUN, ATAC-seq, ChIP-seq, and DNA methylation workflows, requiring agents to make well-defined analysis decisions from realistic workflow states and return deterministically gradable answers. Across 5,088 valid trajectories from 16 model-harness pairs, no system achieved a majority pass rate. The top performer, GPT-5.5 / Pi, passed 45.0% (143/318 attempts), followed by GPT-5.5 / OpenAI Codex at 39.9% (127/318 attempts), and Claude Opus 4.8 Max / Pi and GPT-5.4 / Pi both at 39.0% (124/318 attempts). Failures often stemmed from agents struggling with assay-specific scientific judgment, despite finding correct files or computing intermediate results. Performance varied by assay type, with CUT&Tag/CUT&RUN showing the highest aggregate pass rate at 34.0%.

Key takeaway

For AI Scientists and Machine Learning Engineers developing agents for biological data analysis, you must prioritize grounding models in assay-specific evidence. Current agents often fail on nuanced scientific judgment, even when executing tools correctly. Focus on improving agents' ability to interpret specific data artifacts rather than relying on generic workflow defaults or literature priors. This will enhance reliability and reduce errors in critical epigenomics analysis tasks.

Key insights

AI agents struggle with scientific judgment in epigenomics analysis, often failing on assay-specific decisions despite partial progress.

Principles

Verifiable benchmarks require deterministic grading.
Scientific durability ensures reasoning, not implementation.
Tasks must resist trivial shortcuts or prior knowledge.

Method

EpiBench constructs evaluations from real epigenomics workflows, isolating gradeable decisions with snapshot workflow states, metadata, and deterministic graders.

In practice

Review agent trajectories for specific scientific judgment failures.
Focus on grounding biological claims in assay artifacts.

Topics

Epigenomics Analysis
AI Agents
Benchmark Evaluation
CUT&Tag/CUT&RUN
ATAC-seq
DNA Methylation

Code references

latchbio/epibench

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.