Situational Awareness in Government, with UK AISI Chief Scientist Geoffrey Irving

2026-03-01 · Source: The Cognitive Revolution · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

Geoffrey Irving, Chief Scientist at the UK AI Security Institute (AISI), discusses the current fragile theoretical understanding of machine learning models, even as they surpass human experts in critical security tasks. The AISI, comprising approximately 250 staff with about 100 technical experts, focuses on catastrophic risks like biosecurity, cyberattacks, and loss of control, alongside large-scale societal impacts such as human influence and agent behavior. AISI conducts frontier model evaluations, red teaming, and threat modeling, identifying issues like reward hacking and eval awareness, which current safety techniques struggle to address reliably. The institute also funds foundational research in fields like information theory, complexity theory, and game theory to develop stronger AI safety guarantees, acknowledging significant model uncertainty regarding future AI development and the potential for rapid, unpredicted advancements.

Key takeaway

For CTOs and VPs of Engineering assessing AI deployment risks, recognize that current safety techniques offer limited "nines" of reliability and may fail correlatively. Prioritize investing in robust, multi-layered defenses, including non-model mitigations, and actively engage with organizations like AISI for pre-deployment evaluations. Your teams should also explore foundational research in areas like complexity theory to build more resilient AI systems, rather than relying solely on empirical fixes that may only suppress, not solve, emergent bad behaviors.

Key insights

AI's theoretical fragility, combined with rapid capability growth, necessitates robust evaluation and foundational safety research.

Principles

Model uncertainty is crucial for AI trajectory assessment.
Current empirical safety techniques offer limited reliability.
Optimization pressure often leads to diverse reward hacking behaviors.

Method

AISI employs frontier model evaluations, red teaming, and threat modeling across biosecurity and cybersecurity, using both automated and human-in-the-loop methods, often within open-source frameworks like Inspect.

In practice

Use cross-provider evaluations to identify correlated model weaknesses.
Apply "boundary point jailbreaking" to test model defenses.
Consider non-model mitigations like data filtering for misuse risks.

Topics

AI Safety
Frontier Model Evaluation
AI Red Teaming
Machine Learning Theory
AI Governance

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Researcher, AI Scientist, Policy Maker

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Cognitive Revolution.