DeEscalWild: A Real-World Benchmark for Automated De-Escalation Training with SLMs

2026-01-24 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

DeEscalWild is a novel, large-scale benchmark dataset designed for training Small Language Models (SLMs) in automated de-escalation for law enforcement. Curated from 5,000 raw "in-the-wild" police-civilian interaction videos from open-source platforms like YouTube and TikTok, the dataset was rigorously filtered to 1,500 high-fidelity scenarios using a hybrid human-in-the-loop and "LLM-as-a-Judge" pipeline. This corpus contains 285,887 dialogue turns, totaling approximately 4.7 million tokens. Experiments demonstrate that SLMs, specifically a fine-tuned Qwen 2.5 (3B-Instruct), significantly outperform their base counterparts and even the larger, general-purpose Gemini 2.5 Flash model across ROUGE-L, BLEU-4, METEOR, and BERTScore metrics. This performance is achieved with a fraction of the computational cost, enabling low-latency, privacy-preserving officer training systems on edge devices like VR headsets.

Key takeaway

For NLP Engineers developing immersive training simulations for law enforcement, leveraging the DeEscalWild dataset and fine-tuning Small Language Models (SLMs) like Qwen 2.5 (3B) offers a superior and more efficient approach than relying on larger, general-purpose LLMs. Your team can achieve higher performance in critical de-escalation scenarios with sub-second latency, enabling deployment on edge devices and ensuring privacy, which is vital for real-time field training.

Key insights

Domain-specific fine-tuning of SLMs on real-world data surpasses larger generalist LLMs for high-stakes de-escalation training.

Principles

Data quality can substitute for parameter scale.
Ecological validity is crucial for high-stakes simulations.
Privacy-preserving edge deployment is achievable with SLMs.

Method

A multi-stage pipeline extracts, transcribes, and filters "in-the-wild" police-civilian videos using LLM-based feature extraction and human verification to create a high-fidelity de-escalation dialogue dataset.

In practice

Fine-tune SLMs with QLoRA for efficiency.
Use Gemini-2.5-Flash for multimodal diarization.
Prioritize real-world data over synthetic for realism.

Topics

De-escalation Training
Small Language Models
DeEscalWild Dataset
Police-Civilian Interactions
Edge AI Deployment

Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.