The Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans
Summary
A novel "riddle riddle" paradigm was introduced to assess flexible reasoning in large language models (LLMs) and humans, distinguishing it from pattern matching. Riddle riddles are word problems structured like popular riddles but require only literal interpretations for correct answers. The study involved two experiments with nine state-of-the-art LLMs and 100 human participants. Results showed LLMs were significantly more accurate on genuine riddles (84.9%) than on riddle riddles (50.7%), indicating a tendency towards inventive reasoning even when literal interpretation suffices. Conversely, humans performed better on riddle riddles (80.5%) than genuine riddles (50.5%). Error analysis revealed 90.8% of LLM errors on riddle riddles stemmed from inappropriate inventive reasoning, while 57.6% of human errors on genuine riddles were due to overextending literal reasoning. This suggests LLMs' strong performance on genuine riddles might reflect memory retrieval rather than flexible strategy selection.
Key takeaway
For AI Scientists evaluating LLM capabilities, you should critically assess whether observed "reasoning" reflects genuine flexible strategy selection or mere pattern matching. Your evaluation paradigms must include stimuli like "riddle riddles" that force models to adapt reasoning based on content, not just form. This approach helps avoid conflating memory retrieval with true reasoning, ensuring more robust and reliable model development.
Key insights
LLMs struggle with flexible reasoning, often conflating pattern matching with genuine understanding, unlike humans who adapt reasoning strategies.
Principles
- LLMs often default to inventive reasoning.
- Surface features can mislead LLMs' reasoning.
- Genuine reasoning requires flexible strategy selection.
Method
The "riddle riddle" paradigm tests flexible reasoning by presenting riddle-like problems requiring literal answers, contrasting performance on genuine riddles.
In practice
- Design stimuli to contrast reasoning types.
- Evaluate LLMs beyond surface-level accuracy.
- Distinguish memory retrieval from flexible reasoning.
Topics
- Large Language Models
- Flexible Reasoning
- Cognitive Tasks
- Riddle Riddle Paradigm
- Human-AI Comparison
- Error Analysis
Best for: AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.