SciHorizon-GENE: Benchmarking LLM for Life Sciences Inference from Gene Knowledge to Functional Understanding

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

SciHorizon-Gene is a new, large-scale benchmark designed to evaluate large language models' (LLMs) ability to reason from gene-level knowledge to functional understanding in life sciences. Constructed from authoritative biological databases, it integrates curated knowledge for over 190,000 human genes and comprises more than 540,000 questions. The benchmark assesses LLMs across four critical perspectives: research attention sensitivity, hallucination tendency, answer completeness, and literature influence, targeting common failure modes. A systematic evaluation of 27 general-purpose and biomedical LLMs revealed substantial heterogeneity in their gene-level reasoning capabilities. The study found persistent challenges in generating faithful, complete, and literature-grounded functional interpretations, noting that domain-specialized models did not consistently outperform general-purpose LLMs.

Key takeaway

For research scientists or machine learning engineers adopting LLMs for biomedical interpretation, you should recognize that current models exhibit systematic reliability gaps in gene-level functional understanding. Do not assume strong general biomedical QA performance translates to faithful gene-level interpretation. Instead, prioritize multi-dimensional evaluation frameworks like SciHorizon-Gene to assess specific failure modes, such as hallucination tendency for low-attention genes or recall limitations in multi-answer queries, before deployment in critical biological analysis pipelines.

Key insights

LLMs exhibit systematic failures in gene-to-function reasoning, struggling with completeness, hallucination, and context integration despite general biomedical promise.

Principles

Method

SciHorizon-Gene constructs a gene-centric benchmark from NCBI Gene, GO, and PubMed data, generating 540K+ questions. It evaluates LLMs across four behavioral perspectives using automated metrics.

In practice

Topics

Best for: AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.