FENCE: A Financial and Multimodal Jailbreak Detection Dataset

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Banking & Financial Services · Depth: Advanced, extended

Summary

FENCE is a new bilingual (Korean–English) multimodal dataset designed to train and evaluate jailbreak detectors for Vision Language Models (VLMs) in financial applications. Developed by Kakaobank, FENCE addresses the scarcity of resources for VLM jailbreak detection, particularly in sensitive financial domains. The dataset emphasizes domain realism with over 15 diverse financial topics and image-grounded threats, featuring a balanced 50:50 ratio of benign and harmful samples. Experiments showed commercial models like GPT-4o and GPT-4o-mini had attack success rates of 3.2% and 9.2% respectively on FENCE, while open-source models displayed greater vulnerability. A baseline detector trained on FENCE achieved 99% in-distribution accuracy and maintained strong performance on external benchmarks, demonstrating its robustness for developing reliable detection models.

Key takeaway

For AI Security Engineers deploying Vision Language Models in financial services, you must prioritize domain-specific jailbreak detection. FENCE provides a critical resource for identifying vulnerabilities and training robust guardrail models, even for highly aligned commercial VLMs like GPT-4o. You should consider fine-tuning smaller, specialized models on datasets like FENCE to achieve high defense success rates and ensure compliance in sensitive financial applications. This approach significantly enhances defensive robustness against multimodal attacks.

Key insights

Multimodal jailbreaking poses significant risks to VLMs, especially in finance, necessitating specialized detection datasets.

Principles

Domain-specific datasets reveal hidden VLM vulnerabilities.
Balanced benign/harmful samples improve detector training.
Robust multimodal understanding is key for detection.

Method

FENCE was constructed via a three-step pipeline: transforming benign financial queries into harmful ones using GPT-4o, collecting query-relevant copyright-free images, and fusing text and images, including FigStep templates.

In practice

Use FENCE to benchmark VLM safety in financial contexts.
Fine-tune lightweight models like Qwen2.5-VL 3B with FENCE.
Prioritize models with strong Image-Text Recognition (ITR) for detection.

Topics

Multimodal Jailbreaking
Vision Language Models
Financial AI Safety
Dataset Development
Jailbreak Detection
GPT-4o Vulnerabilities

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.