The Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans

2026-06-25 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

A novel "riddle riddle" paradigm was introduced to assess flexible reasoning in large language models (LLMs) and humans, distinguishing it from pattern matching. Riddle riddles are word problems structured like popular riddles but require only literal interpretations for correct answers. The study involved two experiments with nine state-of-the-art LLMs and 100 human participants. Results showed LLMs were significantly more accurate on genuine riddles (84.9%) than on riddle riddles (50.7%), indicating a tendency towards inventive reasoning even when literal interpretation suffices. Conversely, humans performed better on riddle riddles (80.5%) than genuine riddles (50.5%). Error analysis revealed 90.8% of LLM errors on riddle riddles stemmed from inappropriate inventive reasoning, while 57.6% of human errors on genuine riddles were due to overextending literal reasoning. This suggests LLMs' strong performance on genuine riddles might reflect memory retrieval rather than flexible strategy selection.

Key takeaway

For AI Scientists evaluating LLM capabilities, you should critically assess whether observed "reasoning" reflects genuine flexible strategy selection or mere pattern matching. Your evaluation paradigms must include stimuli like "riddle riddles" that force models to adapt reasoning based on content, not just form. This approach helps avoid conflating memory retrieval with true reasoning, ensuring more robust and reliable model development.

Key insights

LLMs struggle with flexible reasoning, often conflating pattern matching with genuine understanding, unlike humans who adapt reasoning strategies.

Principles

LLMs often default to inventive reasoning.
Surface features can mislead LLMs' reasoning.
Genuine reasoning requires flexible strategy selection.

Method

The "riddle riddle" paradigm tests flexible reasoning by presenting riddle-like problems requiring literal answers, contrasting performance on genuine riddles.

In practice

Design stimuli to contrast reasoning types.
Evaluate LLMs beyond surface-level accuracy.
Distinguish memory retrieval from flexible reasoning.

Topics

Large Language Models
Flexible Reasoning
Cognitive Tasks
Riddle Riddle Paradigm
Human-AI Comparison
Error Analysis

Best for: AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.