BAGEL: Benchmarking Animal Knowledge Expertise in Language Models

2026-04-21 · Source: cs.AI updates on arXiv.org · Field: Science & Research — Life Sciences & Biology, Environmental Science & Earth Systems, Research Methodology & Innovation · Depth: Expert, extended

Summary

The BAGEL (Benchmark for evaluating Animal knowledGe Expertise in Language models) benchmark has been introduced to assess how well large language models (LLMs) handle specialized animal-related knowledge in a closed-book evaluation. Comprising 11,852 multiple-choice questions, BAGEL draws from diverse scientific and reference sources including Wikipedia, Global Biotic Interactions (GloBI), bioRxiv, and Xeno-canto. It covers various aspects of animal knowledge such as taxonomy, morphology, habitat, behavior, vocalization, geographic distribution, and species interactions. The benchmark is designed to measure models' inherent knowledge without external retrieval during inference and supports fine-grained analysis across source domains, taxonomic groups, and knowledge categories. Initial evaluations of frontier closed-source models like GPT-5.4 and Claude Opus 4.6, alongside open-weight models such as Gemma 3 27B IT and Llama 3.1-8B Instruct, reveal substantial performance variation across domains, with consistently weaker scores on Xeno-canto-derived questions.

Key takeaway

For AI Scientists and Machine Learning Engineers developing biodiversity-related applications, you should integrate BAGEL into your evaluation pipeline to precisely characterize model strengths and systematic failure modes in animal knowledge. Your models may perform well on general text-heavy domains but struggle with specialized bioacoustic text, indicating a need for targeted fine-tuning or architectural adjustments for specific knowledge types.

Key insights

BAGEL evaluates LLMs' specialized animal knowledge across diverse sources in a closed-book, multiple-choice format.

Principles

Domain-specific benchmarks reveal nuanced LLM capabilities.
Closed-book evaluation measures intrinsic model knowledge.
Performance varies significantly across knowledge domains.

Method

BAGEL constructs multiple-choice questions from Wikipedia, GloBI, bioRxiv, and Xeno-canto using GPT-4o-mini, then applies option shuffling to mitigate positional bias.

In practice

Use BAGEL to diagnose LLM strengths in biodiversity tasks.
Focus on domain-level accuracy, not just overall scores.
Be aware of potential multiple-choice artifacts in evaluations.

Topics

BAGEL Benchmark
Language Model Evaluation
Animal Knowledge
Biodiversity Applications
Closed-Book Question Answering

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.