Pretraining Data Filtering for Open-Weight AI Safety

2025-08-12 · Source: Blog on EleutherAI Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Advanced, long

Summary

EleutherAI's paper, "Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs," introduces a novel approach to enhancing the safety of open-weight large language models (LLMs) by filtering undesirable knowledge during pretraining. The research focuses on preventing biorisk knowledge, using the WMDP-Bio benchmark, and employs a scalable multi-stage filtering pipeline consisting of a blocklist and an ML classifier. This pipeline processes over 400 million documents with less than a 1% increase in FLOPS. The study trains multiple 6.9B parameter models on 550B tokens, demonstrating that filtering can reduce biorisk knowledge to near-random chance levels without significantly degrading general knowledge benchmarks like MMLU. Furthermore, the filtered models exhibit tamper-resistance against fine-tuning on unsafe data or even benign data, outperforming traditional safeguards like circuit breakers. However, data filtering does not prevent in-context retrieval of undesirable knowledge, suggesting a need for combined defense-in-depth strategies.

Key takeaway

For research scientists developing or deploying open-weight LLMs, you should integrate pretraining data filtering as a foundational safety measure. This approach demonstrably reduces the acquisition of undesirable knowledge and enhances tamper-resistance against fine-tuning, offering a more robust safeguard than post-hoc suppression methods. However, recognize that filtering alone does not prevent in-context retrieval, so combine it with other interventions to build a comprehensive defense-in-depth strategy.

Key insights

Pretraining data filtering effectively builds tamper-resistant safeguards into open-weight LLMs by preventing undesirable knowledge acquisition.

Principles

Eliminate concerning data early in pretraining.
Underfiltering is a more common concern than overfiltering.
Combine filtering with other interventions for defense-in-depth.

Method

A multi-stage filtering pipeline uses a CPU-bound blocklist for initial rejection, followed by an ML classifier (ModernBERT-Large fine-tuned with Llama 3.3 synthetic data) to review escalated documents, minimizing compute overhead.

In practice

Implement multi-stage data filtering for open-weight LLMs.
Prioritize filtering for specific high-risk knowledge domains.
Combine filtering with tamper-resistant post-training methods.

Topics

LLM Safety
Pretraining Data Filtering
Open-Weight Models
Tamper-Resistance
Biorisk Prevention

Best for: Research Scientist, AI Researcher, AI Scientist, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Blog on EleutherAI Blog.