OpenAI Open-Sources Privacy Filter, a Tiny Model That Scrubs PII Without an API Call

2026-04-22 · Source: HackerNoon · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Cybersecurity & Data Privacy · Depth: Intermediate, medium

Summary

OpenAI has released Privacy Filter, an Apache 2.0 licensed, bidirectional token-classification model designed for personally identifiable information (PII) detection and masking. Available on Hugging Face and GitHub, this model features 1.5 billion parameters, with only 50 million active due to mixture-of-experts routing, enabling it to run efficiently on laptops or in browsers without requiring API calls. Privacy Filter identifies eight PII categories—names, addresses, emails, phone numbers, URLs, dates, account numbers, and secrets—using a BIOES span-tagging scheme and a constrained Viterbi procedure for coherent redaction. Architecturally, it is a smaller variant of OpenAI's gpt-oss models, featuring a 128K token context window and a CLI tool for redaction, evaluation, and fine-tuning. Its key differentiators are context-awareness, small size, and fine-tunability with minimal data.

Key takeaway

For Machine Learning Engineers building data pipelines that process user text before LLM interaction, Privacy Filter offers a robust, local PII redaction solution. Its small footprint and 128K context window allow for efficient, in-infrastructure processing of long documents, reducing privacy risks. You can fine-tune it with minimal data to achieve high domain-specific accuracy, making it suitable for regulated environments.

Key insights

Privacy Filter offers local, context-aware PII detection with high efficiency and fine-tuning capabilities.

Principles

Context-awareness improves PII recall significantly.
Sparse MoE enables large models with small active parameter counts.
Small datasets can dramatically improve domain-specific F1 scores.

Method

The model uses a bidirectional banded attention transformer with a 33-class token-classification head, post-trained with supervised classification loss, and decodes spans via a constrained Viterbi procedure.

In practice

Integrate Privacy Filter for local PII stripping before LLM processing.
Use the `opf` CLI for one-shot redaction or piped input.
Fine-tune with minimal data to adapt to domain-specific PII.

Topics

Privacy Filter
PII Detection
Token Classification Model
Mixture-of-Experts
Local Data Processing

Best for: CTO, Machine Learning Engineer, NLP Engineer, AI Engineer, MLOps Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by HackerNoon.