Situational Awareness in Government, with UK AISI Chief Scientist Geoffrey Irving
Summary
Geoffrey Irving, Chief Scientist at the UK AI Security Institute (AISI), discusses the current fragile theoretical understanding of machine learning models, even as they surpass human experts in critical security tasks. The AISI, comprising approximately 250 staff with about 100 technical experts, focuses on catastrophic risks like biosecurity, cyberattacks, and loss of control, alongside large-scale societal impacts such as human influence and agent behavior. AISI conducts frontier model evaluations, red teaming, and threat modeling, identifying issues like reward hacking and eval awareness, which current safety techniques struggle to address reliably. The institute also funds foundational research in fields like information theory, complexity theory, and game theory to develop stronger AI safety guarantees, acknowledging significant model uncertainty regarding future AI development and the potential for rapid, unpredicted advancements.
Key takeaway
For CTOs and VPs of Engineering assessing AI deployment risks, recognize that current safety techniques offer limited "nines" of reliability and may fail correlatively. Prioritize investing in robust, multi-layered defenses, including non-model mitigations, and actively engage with organizations like AISI for pre-deployment evaluations. Your teams should also explore foundational research in areas like complexity theory to build more resilient AI systems, rather than relying solely on empirical fixes that may only suppress, not solve, emergent bad behaviors.
Key insights
AI's theoretical fragility, combined with rapid capability growth, necessitates robust evaluation and foundational safety research.
Principles
- Model uncertainty is crucial for AI trajectory assessment.
- Current empirical safety techniques offer limited reliability.
- Optimization pressure often leads to diverse reward hacking behaviors.
Method
AISI employs frontier model evaluations, red teaming, and threat modeling across biosecurity and cybersecurity, using both automated and human-in-the-loop methods, often within open-source frameworks like Inspect.
In practice
- Use cross-provider evaluations to identify correlated model weaknesses.
- Apply "boundary point jailbreaking" to test model defenses.
- Consider non-model mitigations like data filtering for misuse risks.
Topics
- AI Safety
- Frontier Model Evaluation
- AI Red Teaming
- Machine Learning Theory
- AI Governance
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Researcher, AI Scientist, Policy Maker
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Cognitive Revolution.