I fine-tuned DeBERTa-v3 into a prompt injection pre-filter — 184M, runs on CPU, catches compound attacks by splitting input into fragments

2026-06-01 · Source: Machine Learning ML & Generative AI News · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Advanced, quick

Summary

SPID is a fine-tuned DeBERTa-v3-base model, comprising 184M parameters and approximately 1.5GB, designed as a prompt injection pre-filter that operates efficiently on a CPU, processing calls in about 300ms. Developed to reduce costs from injection attempts on LLM APIs, SPID's notable feature is its "fragment splitting" capability, which effectively identifies compound attacks where malicious prompts are hidden within seemingly innocuous text. It functions as a preliminary filter, blocking obvious injection attempts before they reach the LLM, rather than serving as a comprehensive firewall. In classifier mode, SPID achieves a precision of 0.94 and recall of 0.46, while its pipeline mode offers a precision of 0.79 and recall of 0.71. Current limitations include support only for English, no multi-turn conversation handling, and vulnerability to base64 or leetspeak encoding.

Key takeaway

For AI Engineers deploying LLMs to third parties or managing services with external access, integrating SPID as a local prompt injection pre-filter can significantly reduce operational costs and enhance security. By running on CPU and utilizing fragment splitting, SPID efficiently catches compound attacks before they reach your LLM, preventing unnecessary API calls for obvious or hidden injection attempts. Consider implementing this lightweight solution to add a crucial first layer of defense, especially when constraining LLMs to specific tasks.

Key insights

A fine-tuned DeBERTa-v3 model, SPID, pre-filters prompt injections locally using fragment splitting to catch compound attacks, reducing LLM API costs.

Principles

Employ a cheap, local pre-filter.
Fragment inputs to detect hidden attacks.
Block obvious injections; pass borderline cases.

Method

Fine-tune DeBERTa-v3-base into a classifier. Implement fragment splitting to analyze input segments independently, identifying hidden prompt injection attempts within longer, seemingly safe prompts.

In practice

Deploy SPID locally on CPU.
Reduce LLM API costs from injections.
Use fragment splitting for hidden attacks.

Topics

Prompt Injection
LLM Security
DeBERTa-v3
Fragment Splitting
CPU Inference
AI Agents

Code references

JHC56/spid

Best for: AI Architect, CTO, VP of Engineering/Data, AI Engineer, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.