How do we (more) safely defer to AIs?

2024-06-17 · Source: Redwood Research blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, AI Alignment · Depth: Expert, extended

Summary

This analysis explores strategies for safely deferring to advanced AI systems, particularly in time-constrained scenarios. It posits that as AI capabilities grow, full or near-full deference to AIs for managing risks, developing successor AIs, and making strategic decisions becomes increasingly necessary and unavoidable. The core objective is to ensure that these AIs are not malicious and are sufficiently aligned and effective, possessing "wisdom" and strong epistemics, especially in domains with poor feedback. The concept of a "Basin of Good Deference" (BGD) is introduced, where initial AIs improve their own alignment and wisdom to ensure future systems are even better. The discussion emphasizes the need to defer to AIs only slightly above the minimum capability required for automating safety research, implementation, and strategy, to mitigate risks like scheming and control loss. The article outlines high-level objectives for successful deference, including maintaining alignment, handling exogenous risks, and making sound strategic choices, while acknowledging the significant risks of rushed deference.

Key takeaway

For research scientists developing advanced AI, you should prioritize building robust behavioral tests that specifically assess alignment and wisdom in hard-to-check, open-ended tasks. Focus on ensuring AIs are corrigible and non-scheming at the minimum viable capability level for automating safety, as this reduces overall risk and allows for iterative self-improvement of alignment within the "Basin of Good Deference." Your efforts should aim to understand and mitigate overfitting to tests, ensuring real-world generalization.

Key insights

Safely deferring to advanced AI requires aligned, wise, and competent systems, ideally at minimal necessary capability.

Principles

Defer to AIs only slightly above minimum capability.
AIs must be corrigible and not scheme against humans.
Bootstrapping alignment and wisdom is key for future AI generations.

Method

The strategy involves avoiding misleading behavioral tests, building robust tests for capabilities and alignment, and iterating to ensure good performance without overfitting, focusing on prosaic ML research approaches.

In practice

Develop specialized environments for alignment training.
Train AIs on tasks directly relevant to post-deference operations.
Study AI psychology to understand propensity evolution.

Topics

AI Deference
AI Alignment
AI Safety
AI Capabilities
Behavioral Testing

Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Redwood Research blog.