Humans vs. Gary Marcus vs. Slate Star Codex: When is an AI failure actually a failure?
Summary
A study challenges Gary Marcus's critiques of large language models (LLMs) like GPT-3, which claim these models lack commonsense reasoning and merely parrot training data. The analysis, inspired by a debate between Marcus and Scott Alexander (Slate Star Codex), re-evaluates five specific "mistakes" Marcus attributed to GPT-3. Researchers presented these same prompts to 15 human "Surgers" and compared their responses to GPT-3's. The findings indicate that in several instances, human responses were similar to or even more "unconventional" than GPT-3's, suggesting that what Marcus identifies as a failure of logical reasoning might instead reflect creative or narrative-driven completions, akin to human thought processes. For example, in a scenario where a lawyer's suit pants are stained, 6 out of 15 humans suggested wearing a bathing suit to court, similar to GPT-3's completion.
Key takeaway
For research scientists evaluating LLM performance, you should critically examine what constitutes a "failure" by incorporating diverse human perspectives. Your evaluation metrics might be too narrow if they exclusively prioritize strict logical reasoning over creative or contextually nuanced responses, as human behavior often deviates from purely logical paths. Consider designing evaluation tasks that account for the imaginative and narrative capabilities of LLMs, rather than solely penalizing deviations from a single "correct" answer.
Key insights
Human evaluation reveals GPT-3's "mistakes" often mirror human creativity or narrative, challenging strict logical reasoning benchmarks.
Principles
- LLM evaluation requires nuanced human context.
- "Commonsense" is not always strictly logical.
- Creativity can be mistaken for error in LLMs.
Method
Five GPT-3 "mistakes" identified by Gary Marcus were re-evaluated by 15 human "Surgers" per prompt, comparing human completions against GPT-3's to assess the nature of perceived errors.
In practice
- Use human evaluators for LLM "failure" analysis.
- Consider narrative intent in LLM responses.
- Re-evaluate "commonsense" benchmarks for LLMs.
Topics
- Large Language Models
- AI Evaluation
- Commonsense Reasoning
- AI Scaling Hypothesis
- Human-AI Comparison
Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Surge AI Blog.