On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment

2026-03-03 · Source: Apple Machine Learning Research · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

A new ICLR paper, "On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment," by Ball et al. from institutions including Ludwig-Maximilians-Universität and UC Berkeley, investigates the computational challenges of aligning large language models (LLMs) to prevent harmful content generation. The research focuses on the intractability of both input prompt filtering and output filtering. The authors demonstrate that adversarial prompts, computationally indistinguishable from benign ones, can be easily constructed for certain LLMs, making efficient prompt filtering impossible. Furthermore, they identify scenarios where output filtering is computationally intractable, with all separation results relying on cryptographic hardness assumptions. The paper concludes that external filters, especially with black-box LLM access, are insufficient for achieving AI safety, arguing that an aligned AI's intelligence is inseparable from its judgment.

Key takeaway

For research scientists and CTOs developing or deploying LLMs, this work indicates that relying solely on external prompt or output filters for safety is computationally infeasible. You should prioritize integrating alignment directly into the LLM's internal architecture and weights, rather than expecting black-box filtering solutions to prevent harmful content generation effectively. This shifts the focus to intrinsic model design for robust AI safety.

Key insights

External filtering is computationally intractable for aligning LLMs, implying intelligence and judgment are inseparable.

Principles

Adversarial prompts can be indistinguishable from benign ones.
Output filtering can be computationally intractable.

Method

The study uses cryptographic hardness assumptions to demonstrate computational intractability for both prompt and output filtering in LLM alignment, formalizing relaxed mitigation approaches.

In practice

Focus on internal LLM architecture for safety.
Black-box access is insufficient for alignment.

Topics

AI Alignment
Large Language Models
Computational Intractability
Prompt Filtering
Output Filtering

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Researcher, AI Scientist, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Apple Machine Learning Research.