Are LLMs Bad at Moral Reasoning?

2026-06-10 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, AI Ethics & Responsible AI · Depth: Expert, quick

Summary

A new analysis challenges previous pessimistic conclusions regarding large language models' (LLMs) moral reasoning capabilities, particularly those drawn from the MoReBench dataset. Earlier research benchmarked frontier AI models against 1,000 gold-standard human-authored rubrics for moral reasoning across various cases, yielding underwhelming results. This paper argues that if LLMs are tasked with generating scoring rubrics for moral analysis, rather than open-ended responses, their performance appears significantly more capable. The LLM-generated rubrics demonstrate better calibration to human-authored rubrics. Where discrepancies exist, they are attributed to the vast dimensionality of moral problems or human departures from rubric creation guidelines, suggesting LLMs possess greater moral reasoning capacity than initially believed.

Key takeaway

For AI Scientists and Ethicists designing or interpreting moral reasoning benchmarks, you should critically re-evaluate the task given to large language models. If your current evaluations rely on scoring LLM open-ended responses, consider shifting to a rubric generation task. This approach may reveal a significantly higher moral competence in LLMs, suggesting that current pessimistic conclusions might stem from methodological choices rather than inherent AI limitations. Adjusting your evaluation framework could lead to more accurate assessments.

Key insights

Re-tasking LLMs to generate moral reasoning rubrics reveals significantly higher moral competence than prior evaluations.

Principles

Moral competence involves identifying and responding to moral reasons.
Evaluation task design critically influences perceived AI capabilities.
Moral problems often possess vast dimensionality.

Method

The method involves giving LLMs the task of generating scoring rubrics for moral analysis of cases, then comparing these generated rubrics against human-authored gold standards.

In practice

Re-evaluate AI systems by altering the task design.
Analyze discrepancies in moral reasoning as dimensionality issues.

Topics

Large Language Models
Moral Reasoning
AI Ethics
MoReBench Dataset
Benchmark Design
AI Evaluation

Best for: Research Scientist, AI Scientist, AI Ethicist

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.