Redact or Keep? A Fully Local AI Cascade for Educational Dialogue De-Identification
Summary
A fully local AI cascade framework has been developed for de-identifying educational dialogue, addressing the challenge of distinguishing personally identifiable information (PII) from curricular content without data egress. This two-stage system first uses a high-recall candidate proposer, combining DeBERTa-v3-base and ModernBERT-base encoders with rule-based patterns to over-generate potential PII spans. The second stage employs a contextual Redact/Keep reviewer, utilizing a LoRA-trained Gemma LLM, to make binary decisions based on surrounding dialogue and speaker role. Evaluated on math tutoring transcripts from two platforms, the strongest configuration, Union + Gemma 31B, achieved a macro F1 of 0.958. This significantly surpassed a same-family LLM-only baseline (0.767 F1) and a commercial API (0.706 F1). On a targeted challenge set for ambiguous names, the Union + Gemma 31B system degraded by only 0.03 F1, demonstrating robustness. The framework is deployable on a single laptop-class machine, requiring approximately 4.3 hours of training per seed and 18 GB of inference memory.
Key takeaway
For NLP Engineers tasked with de-identifying sensitive educational dialogue, you should prioritize a cascaded approach over single-pass LLM extraction. This method, separating high-recall candidate proposal from contextual Redact/Keep review, achieves superior accuracy (0.958 F1) and robustness on ambiguous names. Your team can deploy this fully locally, avoiding data egress risks, with a 31B Gemma reviewer requiring only 18 GB of inference memory on a laptop-class machine. This strategy preserves data utility while ensuring privacy compliance.
Key insights
Problem formulation, separating PII candidate proposal from contextual Redact/Keep review, outperforms model scale for local educational de-identification.
Principles
- Educational de-identification is contextual privacy triage.
- Prioritize recall in initial PII candidate generation.
- Local deployment is feasible for high-accuracy de-identification.
Method
A two-stage cascade: (1) Union of DeBERTa-v3-base, ModernBERT-base, and RegEx rules over-generates PII candidates. (2) A LoRA-trained Gemma LLM reviewer makes binary Redact/Keep decisions for each candidate using context.
In practice
- Implement a recall-first PII candidate proposer.
- Use a small LLM for contextual Redact/Keep decisions.
- Consider human-review for low-confidence PII cases.
Topics
- Educational NLP
- Data De-identification
- Local LLM Deployment
- PII Detection
- Name Disambiguation
- Cascade AI Framework
Best for: AI Engineer, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.