ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions
Summary
ChLogic is a new English-Chinese aligned benchmark designed to evaluate the robustness of large language models' logical reasoning capabilities across languages. It assesses whether models maintain performance when the same underlying logical structure is expressed in English and various Chinese surface realizations. Constructed from formal logical templates, ChLogic comprises three datasets: a General aligned set derived from 60 General Propositions across nine template families, a Difficult aligned set from 40 Difficult Problems, and a Chinese-only set covering 15 language-specific phenomenon types. Each aligned item pairs one English reference expression with five distinct Chinese realizations. Experiments conducted on Qwen3, Ministral, and GLM models revealed a consistent English-Chinese performance gap. Back-translation from standard Chinese to English often improved results on the General aligned set but showed mixed effects on the Difficult aligned set, with Qwen3-32B and GLM-5.1 performing worse. These findings indicate that Chinese surface realization, translation artifacts, and model-specific behavior collectively influence multilingual logical reasoning.
Key takeaway
For NLP engineers deploying large language models in multilingual contexts, particularly for Chinese logical reasoning, you must account for the persistent English-Chinese performance gap. Your evaluation should include stress tests like ChLogic to identify how Chinese surface realizations and translation artifacts impact reasoning robustness. Be cautious with back-translation strategies; they can degrade performance for models like Qwen3-32B and GLM-5.1 on complex problems. Thorough model-specific validation is essential.
Key insights
ChLogic reveals a persistent English-Chinese logical reasoning gap in LLMs, influenced by Chinese surface forms and translation artifacts.
Principles
- LLM logical reasoning lacks multilingual robustness.
- Chinese surface forms affect reasoning performance.
- Translation artifacts introduce performance variability.
Method
ChLogic constructs an English-Chinese aligned benchmark using formal logical templates, pairing English expressions with five diverse Chinese realizations across general, difficult, and Chinese-only datasets.
In practice
- Use ChLogic to stress test multilingual LLMs.
- Evaluate back-translation impact on specific models.
- Analyze Chinese surface forms for reasoning failures.
Topics
- Large Language Models
- Multilingual NLP
- Logical Reasoning
- ChLogic Benchmark
- Chinese Language Processing
- Model Robustness
- Back-translation
Best for: Research Scientist, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.