ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

ImplicitMemBench is the first systematic benchmark designed to evaluate implicit memory in large language models (LLMs), focusing on unconscious behavioral adaptation rather than explicit recall. Developed by researchers from The University of Hong Kong and Harbin Institute of Technology, this 300-item suite assesses three cognitively grounded constructs: Procedural Memory (one-shot skill acquisition after interference), Priming (theme-driven bias), and Classical Conditioning (CS–US associations shaping first decisions). The benchmark employs a unified Learning/Priming–Interfere–Test protocol with first-attempt scoring. Evaluation of 17 models, including DeepSeek-R1, Qwen3-32B, and GPT-5, revealed severe limitations, with no model exceeding 66% overall accuracy, significantly below human baselines. Analysis highlighted dramatic asymmetries, such as inhibition tasks achieving only 17.6% accuracy versus preference tasks at 75.0%, and identified universal bottlenecks requiring architectural innovations beyond mere parameter scaling.

Key takeaway

For research scientists developing next-generation LLM agents, you should prioritize architectural innovations that specifically target implicit memory mechanisms. Current models demonstrate a profound inability to consolidate experiences into automated behaviors, particularly in tasks requiring inhibition or subtle contextual adaptation. Your efforts should move beyond simply scaling parameters or augmenting explicit memory, focusing instead on fundamental changes that enable true unconscious learning and robust, automatic responses to learned patterns.

Key insights

LLMs severely lack implicit memory, struggling with automated behavioral adaptation and unconscious learning.

Principles

Method

ImplicitMemBench uses a three-phase Learning/Priming–Interfere–Test protocol with first-attempt scoring to isolate automatized behavior. It operationalizes procedural memory, priming, and classical conditioning through text-based agentic scenarios.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.