Redact or Keep? A Fully Local AI Cascade for Educational Dialogue De-Identification

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Data Science & Analytics · Depth: Expert, extended

Summary

A fully local AI cascade framework has been developed for de-identifying educational dialogue, addressing the challenge of distinguishing personally identifiable information (PII) from curricular content without data egress. This two-stage system first uses a high-recall candidate proposer, combining DeBERTa-v3-base and ModernBERT-base encoders with rule-based patterns to over-generate potential PII spans. The second stage employs a contextual Redact/Keep reviewer, utilizing a LoRA-trained Gemma LLM, to make binary decisions based on surrounding dialogue and speaker role. Evaluated on math tutoring transcripts from two platforms, the strongest configuration, Union + Gemma 31B, achieved a macro F1 of 0.958. This significantly surpassed a same-family LLM-only baseline (0.767 F1) and a commercial API (0.706 F1). On a targeted challenge set for ambiguous names, the Union + Gemma 31B system degraded by only 0.03 F1, demonstrating robustness. The framework is deployable on a single laptop-class machine, requiring approximately 4.3 hours of training per seed and 18 GB of inference memory.

Key takeaway

For NLP Engineers tasked with de-identifying sensitive educational dialogue, you should prioritize a cascaded approach over single-pass LLM extraction. This method, separating high-recall candidate proposal from contextual Redact/Keep review, achieves superior accuracy (0.958 F1) and robustness on ambiguous names. Your team can deploy this fully locally, avoiding data egress risks, with a 31B Gemma reviewer requiring only 18 GB of inference memory on a laptop-class machine. This strategy preserves data utility while ensuring privacy compliance.

Key insights

Problem formulation, separating PII candidate proposal from contextual Redact/Keep review, outperforms model scale for local educational de-identification.

Principles

Method

A two-stage cascade: (1) Union of DeBERTa-v3-base, ModernBERT-base, and RegEx rules over-generates PII candidates. (2) A LoRA-trained Gemma LLM reviewer makes binary Redact/Keep decisions for each candidate using context.

In practice

Topics

Best for: AI Engineer, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.