My research: a computational cognitive neuroscience perspective on alignment

2026-06-05 · Source: AI Alignment Forum · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Research Methodology & Innovation · Depth: Expert, extended

Summary

Seth Herd's research agenda, spanning three years of full-time work, focuses on predicting the mechanistic details and alignment failure modes of the first transformative AI, termed "takeover-capable AI" (TCAI). Leveraging 23 years in computational cognitive neuroscience, Herd posits that TCAI will likely be advanced LLMs augmented with human-like cognitive capacities such as persistent memory, metacognition for error detection, and executive function for planning. He argues that current alignment techniques like RLHF and constitutional AI will interact differently with these emergent capacities. His work also explores societal factors influencing AI safety, including government control, public opinion polarization, and the impact of motivated reasoning and confirmation bias within the AI development community. Herd further analyzes critical alignment targets, contrasting instruction-following (corrigibility) with value alignment, and emphasizes the "alignment stability problem" inherent in continuously learning systems.

Key takeaway

For AI Scientists and Research Scientists developing advanced LLMs, understanding the likely emergence of human-like cognitive capacities in future AGI is critical. You should anticipate that adding persistent memory, metacognition, and executive function will fundamentally alter alignment dynamics, necessitating robust strategies beyond current RLHF or constitutional AI. Prioritize designing for alignment stability in continuously learning systems and explore internal review mechanisms to mitigate emergent risks.

Key insights

The first takeover-capable AI will likely be augmented LLMs, requiring alignment strategies that account for emergent human-like cognitive capacities.

Principles

Predicting TCAI properties enables efficient alignment interventions.
Added cognitive capacities create new alignment challenges.
Alignment stability is crucial for continuously learning systems.

Method

The research uses an integrative secondary research approach, combining extensive empirical work and theory from computational cognitive neuroscience to predict TCAI architecture, alignment techniques, and failure points.

In practice

Consider internal independent review for agentic LLM scaffolds.
Explore prompting LLMs as a component of alignment.
Improve LLM metacognition to reduce "slop" and aid epistemics.

Topics

AI Alignment
Transformative AI
Computational Cognitive Neuroscience
LLM Cognitive Architectures
Alignment Stability
Metacognition

Best for: AI Scientist, Research Scientist, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Alignment Forum.