My research: a computational cognitive neuroscience perspective on alignment
Summary
Seth Herd's research agenda, spanning three years of full-time work, focuses on predicting the mechanistic details and alignment failure modes of the first transformative AI, termed "takeover-capable AI" (TCAI). Leveraging 23 years in computational cognitive neuroscience, Herd posits that TCAI will likely be advanced LLMs augmented with human-like cognitive capacities such as persistent memory, metacognition for error detection, and executive function for planning. He argues that current alignment techniques like RLHF and constitutional AI will interact differently with these emergent capacities. His work also explores societal factors influencing AI safety, including government control, public opinion polarization, and the impact of motivated reasoning and confirmation bias within the AI development community. Herd further analyzes critical alignment targets, contrasting instruction-following (corrigibility) with value alignment, and emphasizes the "alignment stability problem" inherent in continuously learning systems.
Key takeaway
For AI Scientists and Research Scientists developing advanced LLMs, understanding the likely emergence of human-like cognitive capacities in future AGI is critical. You should anticipate that adding persistent memory, metacognition, and executive function will fundamentally alter alignment dynamics, necessitating robust strategies beyond current RLHF or constitutional AI. Prioritize designing for alignment stability in continuously learning systems and explore internal review mechanisms to mitigate emergent risks.
Key insights
The first takeover-capable AI will likely be augmented LLMs, requiring alignment strategies that account for emergent human-like cognitive capacities.
Principles
- Predicting TCAI properties enables efficient alignment interventions.
- Added cognitive capacities create new alignment challenges.
- Alignment stability is crucial for continuously learning systems.
Method
The research uses an integrative secondary research approach, combining extensive empirical work and theory from computational cognitive neuroscience to predict TCAI architecture, alignment techniques, and failure points.
In practice
- Consider internal independent review for agentic LLM scaffolds.
- Explore prompting LLMs as a component of alignment.
- Improve LLM metacognition to reduce "slop" and aid epistemics.
Topics
- AI Alignment
- Transformative AI
- Computational Cognitive Neuroscience
- LLM Cognitive Architectures
- Alignment Stability
- Metacognition
Best for: AI Scientist, Research Scientist, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI Alignment Forum.