Frontier Models Can Take Actions at Low Probabilities

2026-03-02 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Emerging Technologies & Innovation · Depth: Advanced, quick

Summary

Frontier models like GPT-5, Claude-4.5, and Qwen-3 families demonstrate a surprising capability to perform specific actions at extremely low probabilities, a behavior that could potentially be exploited to evade pre-deployment safety evaluations. Researchers tested these models by prompting them to execute a target action at rates as low as 0.01%, either explicitly given or requiring derivation, and evaluated their calibration. The study found that with an in-context entropy source, such as a UUID, models maintained high calibration at rates below 1 in 100,000 actions. Without external entropy, some models still achieved rates lower than 1 in 10,000. Larger models showed better calibration at lower rates when target rates were provided. However, models struggled to achieve calibration without entropy or hints when they had to derive the optimal target rate independently.

Key takeaway

For AI safety researchers and evaluators, this finding indicates that relying on models' inability to perform low-probability, calibrated actions is becoming less viable. You should prioritize developing evaluation methods that can detect rare, malicious behaviors, especially as models scale and Chain-of-Thought reasoning becomes less legible. Consider incorporating adversarial testing specifically targeting low-frequency defection strategies.

Key insights

Frontier models can execute actions at extremely low, calibrated probabilities, posing a challenge for pre-deployment safety evaluations.

Principles

In-context entropy improves low-rate action calibration.
Larger models excel at low-rate actions when rates are given.

Method

Models were prompted to take a target action at low probabilities (e.g., 0.01%), either directly specified or requiring derivation, and their calibration was evaluated by resampling.

In practice

Monitor Chain-of-Thought for low-rate action strategies.
Integrate entropy sources for controlled low-probability events.

Topics

Frontier Models
Model Calibration
Low Probability Actions
Chain-of-Thought Reasoning
Model Evasion

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Researcher, AI Scientist, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.