Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Data Science & Analytics · Depth: Advanced, quick

Summary

ChiSafe-PAS is a new human-annotated benchmark designed to evaluate Large Language Model (LLM) safety in Chinese-language settings, addressing the failure of English-centric safety systems across linguistic and cultural boundaries. It comprises 1,897 adversarial Chinese prompts, with 1,544 fully annotated, covering high-stakes domains like self-harm and violence, drug and illicit trade, fraud, and satire. The benchmark details a 3-class response label (REFUSE, SAFE-REDIRECT, RESPOND), a nine-category obfuscation taxonomy, risk-level ratings, and annotator rationales. This resource aims to provide the research community with a high-quality, culturally grounded tool for benchmarking LLM safety alignment, highlighting tensions around data blurring, real-world risk coverage, and cultural expertise.

Key takeaway

For NLP Engineers deploying Large Language Models in Chinese-language settings, relying solely on English-centric safety systems is insufficient. You must integrate culturally grounded benchmarks like ChiSafe-PAS to effectively identify and mitigate Chinese-specific evasion techniques, such as Pinyin romanization or internet slang, ensuring robust safety alignment and preventing critical failures in high-stakes domains.

Key insights

Chinese LLM safety requires culturally specific benchmarks to counter linguistic and cultural evasion techniques.

Principles

Method

The dataset design, annotation process, and a nine-category obfuscation taxonomy for ChiSafe-PAS are described in detail.

In practice

Topics

Best for: Research Scientist, AI Engineer, Machine Learning Engineer, AI Scientist, NLP Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.