The State of Peer Review in Empirical Software Engineering: A Community Survey on Review Load, Quality, and GenAI Use

· Source: cs.SE updates on arXiv.org · Field: Science & Research — Research Methodology & Innovation, Software Development & Engineering, Artificial Intelligence & Machine Learning · Depth: Advanced, extended

Summary

A recent questionnaire survey involving 120 Empirical Software Engineering (ESE) community members, predominantly seasoned academics from Europe and North America, reveals significant challenges in the peer review system. Two-thirds of respondents perceive their review load as high or very high, averaging 25-32 reviews annually, with conferences accounting for two-thirds of this effort. While participants generally rate their own review quality highly, they frequently cite high workload (110 respondents), mismatched expertise (64), and insufficient recognition (51) as obstacles to quality. Common issues in received reviews include shallowness (82), generic feedback (59), and unrealistic demands (48). Over half of respondents (70) do not use LLMs for reviewing, but among those who do, public services like ChatGPT (36) are common, often for presentation or politeness. The community is divided on exploring LLM integration, but a strong majority (81) supports banning unethical LLM use by authors and reviewers. Suggestions for improvement focus on reducing review load, enhancing governance, and responsibly integrating LLMs.

Key takeaway

For Research Scientists and Directors of AI/ML managing publication strategies, recognize that the ESE peer review system is strained by high workloads and inconsistent quality. You should prioritize ethical GenAI use, providing clear guidelines and enforcing strict penalties for misuse. Advocate for improved reviewer incentives and consider implementing early desk rejections to alleviate reviewer burden and enhance overall review quality within your community.

Key insights

The ESE peer review system faces unsustainable workload, quality issues, and ethical dilemmas exacerbated by GenAI, demanding urgent systemic changes.

Principles

Method

Conducted a questionnaire survey with 120 ESE reviewers, using 22 questions (mostly multiple-choice) covering load, quality, LLM use, and improvement suggestions, with pilot testing and anonymous data collection.

In practice

Topics

Best for: Research Scientist, AI Scientist, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.