Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese
Summary
ChiSafe-PAS is a new human-annotated benchmark designed to evaluate Large Language Model (LLM) safety in Chinese-language settings, addressing the failure of English-centric safety systems across linguistic and cultural boundaries. It comprises 1,897 adversarial Chinese prompts, with 1,544 fully annotated, covering high-stakes domains like self-harm and violence, drug and illicit trade, fraud, and satire. The benchmark details a 3-class response label (REFUSE, SAFE-REDIRECT, RESPOND), a nine-category obfuscation taxonomy, risk-level ratings, and annotator rationales. This resource aims to provide the research community with a high-quality, culturally grounded tool for benchmarking LLM safety alignment, highlighting tensions around data blurring, real-world risk coverage, and cultural expertise.
Key takeaway
For NLP Engineers deploying Large Language Models in Chinese-language settings, relying solely on English-centric safety systems is insufficient. You must integrate culturally grounded benchmarks like ChiSafe-PAS to effectively identify and mitigate Chinese-specific evasion techniques, such as Pinyin romanization or internet slang, ensuring robust safety alignment and preventing critical failures in high-stakes domains.
Key insights
Chinese LLM safety requires culturally specific benchmarks to counter linguistic and cultural evasion techniques.
Principles
- English LLM safety systems often fail across linguistic and cultural boundaries.
- Chinese-specific evasion techniques include Pinyin, character decomposition, slang, and hedging tone.
- Culturally grounded resources are crucial for robust LLM safety alignment.
Method
The dataset design, annotation process, and a nine-category obfuscation taxonomy for ChiSafe-PAS are described in detail.
In practice
- Benchmark LLM safety alignment using culturally grounded Chinese datasets.
- Identify and address Chinese-specific linguistic evasion techniques.
Topics
- LLM Safety Evaluation
- Chinese Language Models
- Adversarial Prompts
- Human Annotation
- Cross-cultural Safety
- Linguistic Evasion
Best for: Research Scientist, AI Engineer, Machine Learning Engineer, AI Scientist, NLP Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.