What is the scientific value of administering the standard Rorschach test to LLMs when the training data is almost certainly contaminated? (R) + [D]

2026-04-28 · Source: Machine Learning · Field: Science & Research — Research Methodology & Innovation, Social Sciences & Behavioral Studies · Depth: Advanced, short

Summary

A recent paper by Csigó & Cserey (2026) in *JMIR Mental Health* administered the 10 standard Rorschach inkblot cards to three multimodal LLMs: GPT-4o, Grok 3, and Gemini 2.0. The researchers coded the models' responses using the Exner Comprehensive System. However, the methodological validity of this study is questioned due to significant data contamination concerns. The standard Rorschach cards, along with extensive psychological literature and typical human responses, are widely available online, making it highly probable that this data is embedded in the LLMs' training weights. Critics argue that the study likely tests the models' ability to retrieve statistically probable lexical associations rather than their perception of visual ambiguity. The study also lacked robust controls, using public web interfaces with default settings and seemingly only one test run per model, leading to a tiny sample size. The authors themselves acknowledged these limitations, noting that models likely encountered the stimuli and scoring concepts during training.

Key takeaway

For AI Scientists evaluating multimodal LLMs' perceptual capabilities, you must prioritize novel, uncontaminated stimuli. Relying on widely available, century-old psychological tests like the Rorschach risks merely demonstrating advanced pattern matching and text completion based on pre-existing training data, rather than genuine understanding of visual ambiguity. Ensure your experimental design includes stringent controls and novel inputs to avoid misinterpreting retrieval as perception.

Key insights

Testing LLMs with widely available psychometric stimuli primarily assesses data retrieval, not genuine perception.

Principles

Data contamination invalidates LLM perception studies.
Standardized tests may reveal retrieval, not understanding.

Method

To assess LLM perception of ambiguity, use novel, AI-generated, or strictly controlled ambiguous images not present in training data, alongside robust experimental controls.

In practice

Verify training data overlap for evaluation stimuli.
Use novel, unseen data for true LLM capability assessment.

Topics

Rorschach Test
LLM Evaluation
Data Contamination
Methodological Validity
Multimodal LLMs

Best for: AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.