Mechanistic Decoding of Cognitive Constructs in LLMs

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A Cognitive Reverse-Engineering framework, based on Representation Engineering (RepE), has been developed to analyze how Large Language Models (LLMs) process complex emotions like social-comparison jealousy. This framework combines appraisal theory with subspace orthogonalization, regression-based weighting, and bidirectional causal steering to isolate and quantify two psychological antecedents: "Superiority of Comparison Person" and "Domain Self-Definitional Relevance." Experiments on eight LLMs from the Llama, Qwen, and Gemma families indicate that these models natively encode jealousy as a structured linear combination of these factors. The internal representations align with human psychological constructs, where Superiority acts as a foundational trigger and Relevance as an intensity multiplier. The framework also shows potential for detecting and suppressing toxic emotional states in LLMs.

Key takeaway

For research scientists investigating LLM interpretability or AI safety, this framework offers a concrete method to reverse-engineer and intervene on complex emotional states. You should consider applying this Cognitive Reverse-Engineering approach to other nuanced cognitive constructs to enhance model transparency and control, particularly in multi-agent systems where emotional dynamics are critical.

Key insights

LLMs encode complex emotions like jealousy as structured linear combinations of psychological antecedents.

Principles

Method

The Cognitive Reverse-Engineering framework uses RepE, appraisal theory, subspace orthogonalization, regression-based weighting, and bidirectional causal steering to analyze emotional antecedents.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.