Jun 8, 2026SciencePaving the way for agents in biology
Summary
A recent analysis highlights the significant challenges AI agents face when interacting with existing biological data infrastructure, which is often heterogeneous and lacks standardized programmatic access. Researchers tasked scientific agents, including Claude, Biomni Open Source, Edison Analysis, and GPT models, to retrieve viral sequence data from NCBI Virus. Initial performance was poor, with mean accuracies ranging from 16.9% to 91.3% and substantial run-to-run variability, leading to critical errors in downstream analyses like phylogenetic tree construction and therapeutic epitope examination. To address this, the team developed "gget virus", a deterministic retrieval layer that coordinates across multiple NCBI APIs. Integrating "gget virus" dramatically improved agent accuracy to over 90% for all tested models, reaching 99.7% for GPT-5.5, while also eliminating variability. This demonstrates the crucial need for reliable, agent-accessible infrastructure for biological data.
Key takeaway
For AI Engineers and Research Scientists building agents for biological data analysis, recognize that your agent's reliability hinges on the underlying data infrastructure. Do not solely depend on advanced LLMs to navigate complex, heterogeneous biological databases. Instead, prioritize integrating deterministic retrieval layers, like "gget virus", into your workflows. This approach will significantly boost accuracy and reproducibility, making your scientific agents trustworthy for critical tasks such as outbreak response or drug design, and reducing dependence on expensive frontier models.
Key insights
Reliable AI agent performance in biology requires deterministic data retrieval layers, not just advanced reasoning.
Principles
- Biological data infrastructure hinders AI agents.
- Deterministic retrieval layers ensure agent reliability.
- Small errors in biology workflows have severe consequences.
Method
"gget virus" coordinates NCBI APIs (REST, Datasets, E-utilities) to replicate web interface filtering, manage batching, and standardize outputs for accurate viral sequence retrieval.
In practice
- Implement deterministic data retrieval tools.
- Design new databases for agent-friendly access.
- Prioritize auditability in agent-driven workflows.
Topics
- AI Agents
- Computational Biology
- Data Infrastructure
- NCBI Virus
- gget virus
- Viral Genomics
- LLM Reliability
Code references
Best for: AI Architect, AI Scientist, Research Scientist, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Anthropic Research.