My research: a computational cognitive neuroscience perspective on alignment

· Source: AI Alignment Forum · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Research Methodology & Innovation · Depth: Expert, extended

Summary

Seth Herd's research agenda, spanning three years of full-time work, focuses on predicting the mechanistic details and alignment failure modes of the first transformative AI, termed "takeover-capable AI" (TCAI). Leveraging 23 years in computational cognitive neuroscience, Herd posits that TCAI will likely be advanced LLMs augmented with human-like cognitive capacities such as persistent memory, metacognition for error detection, and executive function for planning. He argues that current alignment techniques like RLHF and constitutional AI will interact differently with these emergent capacities. His work also explores societal factors influencing AI safety, including government control, public opinion polarization, and the impact of motivated reasoning and confirmation bias within the AI development community. Herd further analyzes critical alignment targets, contrasting instruction-following (corrigibility) with value alignment, and emphasizes the "alignment stability problem" inherent in continuously learning systems.

Key takeaway

For AI Scientists and Research Scientists developing advanced LLMs, understanding the likely emergence of human-like cognitive capacities in future AGI is critical. You should anticipate that adding persistent memory, metacognition, and executive function will fundamentally alter alignment dynamics, necessitating robust strategies beyond current RLHF or constitutional AI. Prioritize designing for alignment stability in continuously learning systems and explore internal review mechanisms to mitigate emergent risks.

Key insights

The first takeover-capable AI will likely be augmented LLMs, requiring alignment strategies that account for emergent human-like cognitive capacities.

Principles

Method

The research uses an integrative secondary research approach, combining extensive empirical work and theory from computational cognitive neuroscience to predict TCAI architecture, alignment techniques, and failure points.

In practice

Topics

Best for: AI Scientist, Research Scientist, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Alignment Forum.