The Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

A novel "riddle riddle" paradigm was introduced to assess flexible reasoning in large language models (LLMs) and humans, distinguishing it from pattern matching. Riddle riddles are word problems structured like popular riddles but require only literal interpretations for correct answers. The study involved two experiments with nine state-of-the-art LLMs and 100 human participants. Results showed LLMs were significantly more accurate on genuine riddles (84.9%) than on riddle riddles (50.7%), indicating a tendency towards inventive reasoning even when literal interpretation suffices. Conversely, humans performed better on riddle riddles (80.5%) than genuine riddles (50.5%). Error analysis revealed 90.8% of LLM errors on riddle riddles stemmed from inappropriate inventive reasoning, while 57.6% of human errors on genuine riddles were due to overextending literal reasoning. This suggests LLMs' strong performance on genuine riddles might reflect memory retrieval rather than flexible strategy selection.

Key takeaway

For AI Scientists evaluating LLM capabilities, you should critically assess whether observed "reasoning" reflects genuine flexible strategy selection or mere pattern matching. Your evaluation paradigms must include stimuli like "riddle riddles" that force models to adapt reasoning based on content, not just form. This approach helps avoid conflating memory retrieval with true reasoning, ensuring more robust and reliable model development.

Key insights

LLMs struggle with flexible reasoning, often conflating pattern matching with genuine understanding, unlike humans who adapt reasoning strategies.

Principles

Method

The "riddle riddle" paradigm tests flexible reasoning by presenting riddle-like problems requiring literal answers, contrasting performance on genuine riddles.

In practice

Topics

Best for: AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.