DeEscalWild: A Real-World Benchmark for Automated De-Escalation Training with SLMs

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

DeEscalWild is a novel, large-scale benchmark dataset designed for training Small Language Models (SLMs) in automated de-escalation for law enforcement. Curated from 5,000 raw "in-the-wild" police-civilian interaction videos from open-source platforms like YouTube and TikTok, the dataset was rigorously filtered to 1,500 high-fidelity scenarios using a hybrid human-in-the-loop and "LLM-as-a-Judge" pipeline. This corpus contains 285,887 dialogue turns, totaling approximately 4.7 million tokens. Experiments demonstrate that SLMs, specifically a fine-tuned Qwen 2.5 (3B-Instruct), significantly outperform their base counterparts and even the larger, general-purpose Gemini 2.5 Flash model across ROUGE-L, BLEU-4, METEOR, and BERTScore metrics. This performance is achieved with a fraction of the computational cost, enabling low-latency, privacy-preserving officer training systems on edge devices like VR headsets.

Key takeaway

For NLP Engineers developing immersive training simulations for law enforcement, leveraging the DeEscalWild dataset and fine-tuning Small Language Models (SLMs) like Qwen 2.5 (3B) offers a superior and more efficient approach than relying on larger, general-purpose LLMs. Your team can achieve higher performance in critical de-escalation scenarios with sub-second latency, enabling deployment on edge devices and ensuring privacy, which is vital for real-time field training.

Key insights

Domain-specific fine-tuning of SLMs on real-world data surpasses larger generalist LLMs for high-stakes de-escalation training.

Principles

Method

A multi-stage pipeline extracts, transcribes, and filters "in-the-wild" police-civilian videos using LLM-based feature extraction and human verification to create a high-fidelity de-escalation dialogue dataset.

In practice

Topics

Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.