ROK-FORTRESS: Measuring the Effect of Geopolitical Transcreation for National Security and Public Safety

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

ROK-FORTRESS is a new bilingual, culturally adversarial benchmark designed to evaluate Large Language Models (LLMs) for National Security and Public Safety (NSPS) risks, specifically focusing on the English-Korean language pair and the U.S.-ROK geopolitical axis. The benchmark, comprising 1,235 tasks, uses a "transcreation matrix" to isolate the effects of language and geopolitical context on LLM safety behavior. It evaluates adversarial prompts under controlled combinations of English vs. Korean language and U.S. vs. Korean entities, institutions, and operational details. Each adversarial prompt is paired with a benign counterpart to measure over-refusal, and responses are scored using calibrated LLM-as-a-judge panels with expert-crafted binary rubrics and Tier-Weighted Risk Scores (TRS). Experiments across 14 frontier and Korean-optimized models reveal a consistent suppression effect in Korean variants and significant model-to-model variation in how geopolitical grounding interacts with language, often mitigating language-driven suppression.

Key takeaway

For NLP Engineers and Research Scientists developing or deploying LLMs in high-stakes global contexts, you should move beyond translation-only safety evaluations. Incorporate culturally adversarial benchmarks like ROK-FORTRESS that account for geopolitical grounding. This approach will help you identify nuanced safety failures and improve model alignment for diverse linguistic and cultural environments, reducing dual-use misuse risks and ensuring more equitable safety for non-English users.

Key insights

Multilingual LLM safety evaluations must consider geopolitical transcreation, not just translation, to accurately assess real-world risks.

Principles

Method

The "transcreation matrix" methodology systematically varies language and cultural grounding to disentangle linguistic effects from contextual/geopolitical grounding effects in LLM safety evaluations, using adversarial-benign prompt pairs and tier-weighted risk scoring.

In practice

Topics

Best for: Research Scientist, NLP Engineer, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.