What is the scientific value of administering the standard Rorschach test to LLMs when the training data is almost certainly contaminated? (R) + [D]
Summary
A recent paper by Csigó & Cserey (2026) in *JMIR Mental Health* administered the 10 standard Rorschach inkblot cards to three multimodal LLMs: GPT-4o, Grok 3, and Gemini 2.0. The researchers coded the models' responses using the Exner Comprehensive System. However, the methodological validity of this study is questioned due to significant data contamination concerns. The standard Rorschach cards, along with extensive psychological literature and typical human responses, are widely available online, making it highly probable that this data is embedded in the LLMs' training weights. Critics argue that the study likely tests the models' ability to retrieve statistically probable lexical associations rather than their perception of visual ambiguity. The study also lacked robust controls, using public web interfaces with default settings and seemingly only one test run per model, leading to a tiny sample size. The authors themselves acknowledged these limitations, noting that models likely encountered the stimuli and scoring concepts during training.
Key takeaway
For AI Scientists evaluating multimodal LLMs' perceptual capabilities, you must prioritize novel, uncontaminated stimuli. Relying on widely available, century-old psychological tests like the Rorschach risks merely demonstrating advanced pattern matching and text completion based on pre-existing training data, rather than genuine understanding of visual ambiguity. Ensure your experimental design includes stringent controls and novel inputs to avoid misinterpreting retrieval as perception.
Key insights
Testing LLMs with widely available psychometric stimuli primarily assesses data retrieval, not genuine perception.
Principles
- Data contamination invalidates LLM perception studies.
- Standardized tests may reveal retrieval, not understanding.
Method
To assess LLM perception of ambiguity, use novel, AI-generated, or strictly controlled ambiguous images not present in training data, alongside robust experimental controls.
In practice
- Verify training data overlap for evaluation stimuli.
- Use novel, unseen data for true LLM capability assessment.
Topics
- Rorschach Test
- LLM Evaluation
- Data Contamination
- Methodological Validity
- Multimodal LLMs
Best for: AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.