ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions

2026-06-16 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

ChLogic is a new English-Chinese aligned benchmark designed to evaluate the robustness of large language models' logical reasoning capabilities across languages. It assesses whether models maintain performance when the same underlying logical structure is expressed in English and various Chinese surface realizations. Constructed from formal logical templates, ChLogic comprises three datasets: a General aligned set derived from 60 General Propositions across nine template families, a Difficult aligned set from 40 Difficult Problems, and a Chinese-only set covering 15 language-specific phenomenon types. Each aligned item pairs one English reference expression with five distinct Chinese realizations. Experiments conducted on Qwen3, Ministral, and GLM models revealed a consistent English-Chinese performance gap. Back-translation from standard Chinese to English often improved results on the General aligned set but showed mixed effects on the Difficult aligned set, with Qwen3-32B and GLM-5.1 performing worse. These findings indicate that Chinese surface realization, translation artifacts, and model-specific behavior collectively influence multilingual logical reasoning.

Key takeaway

For NLP engineers deploying large language models in multilingual contexts, particularly for Chinese logical reasoning, you must account for the persistent English-Chinese performance gap. Your evaluation should include stress tests like ChLogic to identify how Chinese surface realizations and translation artifacts impact reasoning robustness. Be cautious with back-translation strategies; they can degrade performance for models like Qwen3-32B and GLM-5.1 on complex problems. Thorough model-specific validation is essential.

Key insights

ChLogic reveals a persistent English-Chinese logical reasoning gap in LLMs, influenced by Chinese surface forms and translation artifacts.

Principles

LLM logical reasoning lacks multilingual robustness.
Chinese surface forms affect reasoning performance.
Translation artifacts introduce performance variability.

Method

ChLogic constructs an English-Chinese aligned benchmark using formal logical templates, pairing English expressions with five diverse Chinese realizations across general, difficult, and Chinese-only datasets.

In practice

Use ChLogic to stress test multilingual LLMs.
Evaluate back-translation impact on specific models.
Analyze Chinese surface forms for reasoning failures.

Topics

Large Language Models
Multilingual NLP
Logical Reasoning
ChLogic Benchmark
Chinese Language Processing
Model Robustness
Back-translation

Best for: Research Scientist, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.