SciR: A Controllable Benchmark for Scientific Reasoning in LLMs

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, quick

Summary

SciR is introduced as a novel, controllable benchmark designed to evaluate Large Language Models (LLMs) on scientific reasoning, specifically deduction, induction, and causal abduction. Existing benchmarks are either costly human-annotated or synthetic logical-reasoning tasks lacking scientific context. SciR generates tasks from formal objects like deduction trees or causal graphs, ensuring verifiable answers, and renders them into multi-document scientific discourse using domain-tuned genres. This construction allows independent control over two difficulty axes: information extraction and principled inference. Initial tests on six models reveal both axes significantly degrade performance, with compounding effects. Notably, the rendering process impacts even neurosymbolic pipelines. The benchmark provides a per-model extraction-vs-inference profile, showing reasoning models like deepseek-r1 generally outperform non-reasoning instruct models on the inference axis.

Key takeaway

For research scientists evaluating LLMs for scientific applications, SciR offers a robust method to benchmark reasoning capabilities. You can precisely control information extraction and inference difficulty, allowing for detailed profiling of model strengths and weaknesses. This enables targeted model selection or fine-tuning based on specific scientific reasoning demands, moving beyond generic performance metrics. Consider using SciR to understand how LLMs handle complex scientific discourse and inferential tasks.

Key insights

SciR is the first multi-paradigm scientific-reasoning benchmark with parametric control over information extraction and inference difficulty.

Principles

Scientific reasoning involves deduction, induction, and causal abduction.
LLM evaluation requires verifiable ground truth.
Information extraction and inference difficulty compound.

Method

Tasks are generated from formal objects (deduction tree, inductive rule hypothesis, causal graph) to guarantee verifiable answers, then rendered into multi-document scientific discourse via per-track domain-tuned genres.

In practice

Benchmark LLMs on multi-paradigm scientific reasoning.
Independently vary extraction and inference difficulty.
Profile models for extraction vs. inference capabilities.

Topics

LLM Benchmarking
Scientific Reasoning
Deductive Reasoning
Inductive Reasoning
Causal Abduction
Information Extraction
Model Evaluation

Best for: AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.