Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese

2026-05-28 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Data Science & Analytics · Depth: Advanced, quick

Summary

ChiSafe-PAS is a new human-annotated benchmark designed to evaluate Large Language Model (LLM) safety in Chinese-language settings, addressing the failure of English-centric safety systems across linguistic and cultural boundaries. It comprises 1,897 adversarial Chinese prompts, with 1,544 fully annotated, covering high-stakes domains like self-harm and violence, drug and illicit trade, fraud, and satire. The benchmark details a 3-class response label (REFUSE, SAFE-REDIRECT, RESPOND), a nine-category obfuscation taxonomy, risk-level ratings, and annotator rationales. This resource aims to provide the research community with a high-quality, culturally grounded tool for benchmarking LLM safety alignment, highlighting tensions around data blurring, real-world risk coverage, and cultural expertise.

Key takeaway

For NLP Engineers deploying Large Language Models in Chinese-language settings, relying solely on English-centric safety systems is insufficient. You must integrate culturally grounded benchmarks like ChiSafe-PAS to effectively identify and mitigate Chinese-specific evasion techniques, such as Pinyin romanization or internet slang, ensuring robust safety alignment and preventing critical failures in high-stakes domains.

Key insights

Chinese LLM safety requires culturally specific benchmarks to counter linguistic and cultural evasion techniques.

Principles

English LLM safety systems often fail across linguistic and cultural boundaries.
Chinese-specific evasion techniques include Pinyin, character decomposition, slang, and hedging tone.
Culturally grounded resources are crucial for robust LLM safety alignment.

Method

The dataset design, annotation process, and a nine-category obfuscation taxonomy for ChiSafe-PAS are described in detail.

In practice

Benchmark LLM safety alignment using culturally grounded Chinese datasets.
Identify and address Chinese-specific linguistic evasion techniques.

Topics

LLM Safety Evaluation
Chinese Language Models
Adversarial Prompts
Human Annotation
Cross-cultural Safety
Linguistic Evasion

Best for: Research Scientist, AI Engineer, Machine Learning Engineer, AI Scientist, NLP Engineer, AI Security Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.