SciHorizon-GENE: Benchmarking LLM for Life Sciences Inference from Gene Knowledge to Functional Understanding
Summary
SciHorizon-Gene is a new, large-scale benchmark designed to evaluate large language models' (LLMs) ability to reason from gene-level knowledge to functional understanding in life sciences. Constructed from authoritative biological databases, it integrates curated knowledge for over 190,000 human genes and comprises more than 540,000 questions. The benchmark assesses LLMs across four critical perspectives: research attention sensitivity, hallucination tendency, answer completeness, and literature influence, targeting common failure modes. A systematic evaluation of 27 general-purpose and biomedical LLMs revealed substantial heterogeneity in their gene-level reasoning capabilities. The study found persistent challenges in generating faithful, complete, and literature-grounded functional interpretations, noting that domain-specialized models did not consistently outperform general-purpose LLMs.
Key takeaway
For research scientists or machine learning engineers adopting LLMs for biomedical interpretation, you should recognize that current models exhibit systematic reliability gaps in gene-level functional understanding. Do not assume strong general biomedical QA performance translates to faithful gene-level interpretation. Instead, prioritize multi-dimensional evaluation frameworks like SciHorizon-Gene to assess specific failure modes, such as hallucination tendency for low-attention genes or recall limitations in multi-answer queries, before deployment in critical biological analysis pipelines.
Key insights
LLMs exhibit systematic failures in gene-to-function reasoning, struggling with completeness, hallucination, and context integration despite general biomedical promise.
Principles
- LLM gene understanding is uneven across research attention levels.
- Hallucination resistance varies by gene attribute type.
- Completeness in multi-answer tasks is limited by recall.
Method
SciHorizon-Gene constructs a gene-centric benchmark from NCBI Gene, GO, and PubMed data, generating 540K+ questions. It evaluates LLMs across four behavioral perspectives using automated metrics.
In practice
- Evaluate LLMs for gene tasks using multi-dimensional benchmarks.
- Prioritize hallucination resistance for sparsely annotated genes.
- Expect recall deficiencies in multi-answer gene queries.
Topics
- Large Language Models
- Biomedical AI
- Gene-to-Function Reasoning
- LLM Benchmarking
- Genomics
- Hallucination Detection
Best for: AI Scientist, Research Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.