The Day I Realized Our LLM Was Quietly Logging Patient SSNs
Summary
An editorial analyst recounts discovering a small LLM summarizer, used for patient intake notes, was inadvertently logging patient Social Security numbers (SSNs), names, and dates of birth in plaintext to a third-party model endpoint. This potential HIPAA breach occurred because user text was directly passed into prompts without redaction. The author implemented a data protection "gate" using Microsoft Presidio, an open-source tool, to swap sensitive values for reversible placeholders before they reach the model. Key design choices included resetting the token map state between requests to prevent cross-patient data mixing and making de-anonymization an explicit, opt-in, and logged action, rather than the default. The experience also highlighted a subtle bug in callback functions, where side effects can lead to incorrect token counts due to multiple invocations by the library, emphasizing the need to test for actual data properties like identifier absence and accurate counts.
Key takeaway
For AI Engineers building LLM features with sensitive data, you must implement a robust architectural gate to prevent PII exposure. Do not rely on prompt instructions; instead, ensure raw data never reaches the model by redacting it with reversible tokens. Always default to redacted output, making de-anonymization an explicit, logged action. Critically, review your current LLM prompt logs immediately to identify any existing PII leaks.
Key insights
Data protection for LLMs is an architectural problem requiring a robust redaction gate, not just instruction.
Principles
- Never send raw sensitive data to an LLM.
- Isolate state to prevent cross-user data bleed.
- Make dangerous operations opt-in, not default.
Method
Implement a two-sided gate using tools like Microsoft Presidio to anonymize sensitive data into reversible tokens before model inference and restore it only on the secure side, explicitly.
In practice
- Use Microsoft Presidio for PII detection.
- Clear token maps per request.
- Default to redacted output; require explicit opt-in.
Topics
- LLM Security
- Data Privacy
- PII Redaction
- Microsoft Presidio
- HIPAA Compliance
- Healthcare AI
Code references
Best for: AI Engineer, MLOps Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by LLM on Medium.