Humans vs. Gary Marcus vs. Slate Star Codex: When is an AI failure actually a failure?

2026-02-19 · Source: Surge AI Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Intermediate, medium

Summary

A study challenges Gary Marcus's critiques of large language models (LLMs) like GPT-3, which claim these models lack commonsense reasoning and merely parrot training data. The analysis, inspired by a debate between Marcus and Scott Alexander (Slate Star Codex), re-evaluates five specific "mistakes" Marcus attributed to GPT-3. Researchers presented these same prompts to 15 human "Surgers" and compared their responses to GPT-3's. The findings indicate that in several instances, human responses were similar to or even more "unconventional" than GPT-3's, suggesting that what Marcus identifies as a failure of logical reasoning might instead reflect creative or narrative-driven completions, akin to human thought processes. For example, in a scenario where a lawyer's suit pants are stained, 6 out of 15 humans suggested wearing a bathing suit to court, similar to GPT-3's completion.

Key takeaway

For research scientists evaluating LLM performance, you should critically examine what constitutes a "failure" by incorporating diverse human perspectives. Your evaluation metrics might be too narrow if they exclusively prioritize strict logical reasoning over creative or contextually nuanced responses, as human behavior often deviates from purely logical paths. Consider designing evaluation tasks that account for the imaginative and narrative capabilities of LLMs, rather than solely penalizing deviations from a single "correct" answer.

Key insights

Human evaluation reveals GPT-3's "mistakes" often mirror human creativity or narrative, challenging strict logical reasoning benchmarks.

Principles

LLM evaluation requires nuanced human context.
"Commonsense" is not always strictly logical.
Creativity can be mistaken for error in LLMs.

Method

Five GPT-3 "mistakes" identified by Gary Marcus were re-evaluated by 15 human "Surgers" per prompt, comparing human completions against GPT-3's to assess the nature of perceived errors.

In practice

Use human evaluators for LLM "failure" analysis.
Consider narrative intent in LLM responses.
Re-evaluate "commonsense" benchmarks for LLMs.

Topics

Large Language Models
AI Evaluation
Commonsense Reasoning
AI Scaling Hypothesis
Human-AI Comparison

Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Surge AI Blog.