SciR: A Controllable Benchmark for Scientific Reasoning in LLMs
Summary
SciR is introduced as a novel, controllable benchmark designed to evaluate Large Language Models (LLMs) on scientific reasoning, specifically deduction, induction, and causal abduction. Existing benchmarks are either costly human-annotated or synthetic logical-reasoning tasks lacking scientific context. SciR generates tasks from formal objects like deduction trees or causal graphs, ensuring verifiable answers, and renders them into multi-document scientific discourse using domain-tuned genres. This construction allows independent control over two difficulty axes: information extraction and principled inference. Initial tests on six models reveal both axes significantly degrade performance, with compounding effects. Notably, the rendering process impacts even neurosymbolic pipelines. The benchmark provides a per-model extraction-vs-inference profile, showing reasoning models like deepseek-r1 generally outperform non-reasoning instruct models on the inference axis.
Key takeaway
For research scientists evaluating LLMs for scientific applications, SciR offers a robust method to benchmark reasoning capabilities. You can precisely control information extraction and inference difficulty, allowing for detailed profiling of model strengths and weaknesses. This enables targeted model selection or fine-tuning based on specific scientific reasoning demands, moving beyond generic performance metrics. Consider using SciR to understand how LLMs handle complex scientific discourse and inferential tasks.
Key insights
SciR is the first multi-paradigm scientific-reasoning benchmark with parametric control over information extraction and inference difficulty.
Principles
- Scientific reasoning involves deduction, induction, and causal abduction.
- LLM evaluation requires verifiable ground truth.
- Information extraction and inference difficulty compound.
Method
Tasks are generated from formal objects (deduction tree, inductive rule hypothesis, causal graph) to guarantee verifiable answers, then rendered into multi-document scientific discourse via per-track domain-tuned genres.
In practice
- Benchmark LLMs on multi-paradigm scientific reasoning.
- Independently vary extraction and inference difficulty.
- Profile models for extraction vs. inference capabilities.
Topics
- LLM Benchmarking
- Scientific Reasoning
- Deductive Reasoning
- Inductive Reasoning
- Causal Abduction
- Information Extraction
- Model Evaluation
Best for: AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.