TukaBench: A Culturally Grounded Jailbreak Benchmark for African Languages
Summary
TUKABENCH is a novel jailbreak benchmark designed to evaluate Large Language Model (LLM) safety across seven African languages, addressing the prevalent English-centric bias in current evaluations. This benchmark extends JailbreakBench (JBB) by incorporating four distinct prompt settings: direct human translation of JBB prompts, English prompts adapted to African contexts then human-translated, human-curated prompts validated via GPT-5.2 interactions, and code-switched prompts combining English and African languages. Evaluations across both closed and open models revealed that prompting in African languages generally reduces refusal rates compared to English, with culturally adapted prompts yielding the lowest refusal. The study also identified two critical structural limitations: LLM comprehension failures and diminished reliability of LLM-as-a-judge in Low-Resource Languages (LRLs). To address these, TUKABENCH introduces "Deflection" as a new metric alongside "Refused" and "Jailbroken," and validates judge outputs with human annotations, demonstrating a decrease in judge-human agreement for LRLs and less commonly supported scripts.
Key takeaway
For NLP Engineers or AI Security Engineers deploying LLMs globally, especially in African language markets, you should recognize that direct English-centric safety benchmarks are insufficient. Your evaluations must incorporate culturally grounded and language-specific prompts, as these significantly alter refusal rates and expose unique comprehension failures. Prioritize human validation for LLM-as-a-judge outputs in Low-Resource Languages to ensure accurate safety assessments, and consider the "Deflection" metric to identify subtle model failures beyond simple refusal.
Key insights
African language prompting reduces LLM refusal, but reveals comprehension and judge reliability issues in low-resource contexts.
Principles
- LLM safety evaluations are heavily English-centric.
- Culturally adapted prompts significantly impact LLM refusal rates.
- LLM-as-a-judge reliability decreases in Low-Resource Languages.
Method
TUKABENCH extends JailbreakBench with four prompt settings: human translation, culturally adapted English then translated, human-curated (GPT-5.2 validated), and code-switched prompts. It introduces "Deflection" to capture comprehension failures and uses human annotations to validate judge reliability.
In practice
- Employ culturally adapted prompts for LLM safety evaluations in LRLs.
- Integrate human validation for LLM-as-a-judge outputs in LRLs.
- Monitor "Deflection" as a metric for LLM comprehension failures.
Topics
- LLM Safety Evaluation
- African Languages
- Jailbreak Benchmarks
- Low-Resource Languages
- Prompt Engineering
- Model Comprehension
Best for: Research Scientist, AI Scientist, NLP Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.