I fine-tuned DeBERTa-v3 into a prompt injection pre-filter — 184M, runs on CPU, catches compound attacks by splitting input into fragments
Summary
SPID is a fine-tuned DeBERTa-v3-base model, comprising 184M parameters and approximately 1.5GB, designed as a prompt injection pre-filter that operates efficiently on a CPU, processing calls in about 300ms. Developed to reduce costs from injection attempts on LLM APIs, SPID's notable feature is its "fragment splitting" capability, which effectively identifies compound attacks where malicious prompts are hidden within seemingly innocuous text. It functions as a preliminary filter, blocking obvious injection attempts before they reach the LLM, rather than serving as a comprehensive firewall. In classifier mode, SPID achieves a precision of 0.94 and recall of 0.46, while its pipeline mode offers a precision of 0.79 and recall of 0.71. Current limitations include support only for English, no multi-turn conversation handling, and vulnerability to base64 or leetspeak encoding.
Key takeaway
For AI Engineers deploying LLMs to third parties or managing services with external access, integrating SPID as a local prompt injection pre-filter can significantly reduce operational costs and enhance security. By running on CPU and utilizing fragment splitting, SPID efficiently catches compound attacks before they reach your LLM, preventing unnecessary API calls for obvious or hidden injection attempts. Consider implementing this lightweight solution to add a crucial first layer of defense, especially when constraining LLMs to specific tasks.
Key insights
A fine-tuned DeBERTa-v3 model, SPID, pre-filters prompt injections locally using fragment splitting to catch compound attacks, reducing LLM API costs.
Principles
- Employ a cheap, local pre-filter.
- Fragment inputs to detect hidden attacks.
- Block obvious injections; pass borderline cases.
Method
Fine-tune DeBERTa-v3-base into a classifier. Implement fragment splitting to analyze input segments independently, identifying hidden prompt injection attempts within longer, seemingly safe prompts.
In practice
- Deploy SPID locally on CPU.
- Reduce LLM API costs from injections.
- Use fragment splitting for hidden attacks.
Topics
- Prompt Injection
- LLM Security
- DeBERTa-v3
- Fragment Splitting
- CPU Inference
- AI Agents
Code references
Best for: AI Architect, CTO, VP of Engineering/Data, AI Engineer, Machine Learning Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.