CHILLGuard: Towards Fine-Grained Chinese LLM Safety Guardrail with Scalable Data Construction and Model-aware Preference Alignment

2026-06-13 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

CHILLGuard is a new content safety guardrail specifically designed for Chinese Large Language Models (LLMs), addressing the limitations of existing systems in handling Chinese regulatory policies, cultural contexts, and linguistic nuances. It introduces a fine-grained risk taxonomy with 5 macro and 31 micro categories. To overcome data scarcity, CHILLGuard employs a scalable multi-stage data construction pipeline, expanding corpora via retrieval-augmented generation, creating implicit harmful samples through prompt engineering, and refining data quality using multi-model voting. This process built CHILLGuardTrain with 405,007 samples and CHILLGuardTest with 51,745 samples. Trained under a generator-classifier collaborative framework with Model-aware Direct Preference Optimization, CHILLGuard demonstrates a 15.92% F1 score improvement over Qwen3Guard-8B-Strict on its benchmark.

Key takeaway

For AI/NLP Engineers deploying LLMs in Chinese markets, existing safety guardrails often fall short due to specific cultural and linguistic requirements. CHILLGuard's fine-grained risk taxonomy and scalable data construction pipeline offer a robust solution for enhanced content moderation. You should consider integrating its methodology or exploring the released resources to improve the safety and compliance of your Chinese LLM applications.

Key insights

CHILLGuard offers a fine-grained Chinese LLM safety guardrail via scalable data construction and model-aware preference alignment.

Principles

Fine-grained taxonomy improves safety adaptation.
Scalable data generation addresses data scarcity.
Multi-model voting refines data quality.

Method

A multi-stage pipeline expands corpus via RAG, generates implicit harmful samples via prompt engineering, and refines data using multi-model voting for label calibration.

In practice

Implement 5-macro, 31-micro risk taxonomy.
Use RAG for corpus expansion.
Apply multi-model voting for data labeling.

Topics

LLM Safety
Chinese LLMs
Content Moderation
Data Generation
Preference Alignment
Guardrails

Code references

cswbyu/CHILLGuard

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.