FENCE: A Financial and Multimodal Jailbreak Detection Dataset
Summary
FENCE is a new bilingual (Korean–English) multimodal dataset designed to train and evaluate jailbreak detectors for Vision Language Models (VLMs) in financial applications. Developed by Kakaobank, FENCE addresses the scarcity of resources for VLM jailbreak detection, particularly in sensitive financial domains. The dataset emphasizes domain realism with over 15 diverse financial topics and image-grounded threats, featuring a balanced 50:50 ratio of benign and harmful samples. Experiments showed commercial models like GPT-4o and GPT-4o-mini had attack success rates of 3.2% and 9.2% respectively on FENCE, while open-source models displayed greater vulnerability. A baseline detector trained on FENCE achieved 99% in-distribution accuracy and maintained strong performance on external benchmarks, demonstrating its robustness for developing reliable detection models.
Key takeaway
For AI Security Engineers deploying Vision Language Models in financial services, you must prioritize domain-specific jailbreak detection. FENCE provides a critical resource for identifying vulnerabilities and training robust guardrail models, even for highly aligned commercial VLMs like GPT-4o. You should consider fine-tuning smaller, specialized models on datasets like FENCE to achieve high defense success rates and ensure compliance in sensitive financial applications. This approach significantly enhances defensive robustness against multimodal attacks.
Key insights
Multimodal jailbreaking poses significant risks to VLMs, especially in finance, necessitating specialized detection datasets.
Principles
- Domain-specific datasets reveal hidden VLM vulnerabilities.
- Balanced benign/harmful samples improve detector training.
- Robust multimodal understanding is key for detection.
Method
FENCE was constructed via a three-step pipeline: transforming benign financial queries into harmful ones using GPT-4o, collecting query-relevant copyright-free images, and fusing text and images, including FigStep templates.
In practice
- Use FENCE to benchmark VLM safety in financial contexts.
- Fine-tune lightweight models like Qwen2.5-VL 3B with FENCE.
- Prioritize models with strong Image-Text Recognition (ITR) for detection.
Topics
- Multimodal Jailbreaking
- Vision Language Models
- Financial AI Safety
- Dataset Development
- Jailbreak Detection
- GPT-4o Vulnerabilities
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.