Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations

2026-04-21 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

A comprehensive empirical evaluation assessed the robustness of 13 Large Language Models (LLMs), ranging from 3B to 1.5T parameters, to five types of Chain-of-Thought (CoT) perturbations: MathError, UnitConversion, Sycophancy, SkippedSteps, and ExtraSteps. The study found heterogeneous vulnerability patterns, with MathError causing the most severe degradation in small models (50-60% accuracy loss) but showing strong scaling benefits. UnitConversion proved challenging across all scales (20-30% loss for largest models), while ExtraSteps incurred minimal degradation (0-6%). Sycophancy resulted in modest effects (7% loss for small models), and SkippedSteps caused intermediate damage (15% loss). Scaling relationships followed power-law patterns, indicating that model size acts as a protective factor against some perturbations but offers limited defense against dimensional reasoning tasks. The findings have direct implications for deploying LLMs in multi-stage reasoning pipelines.

Key takeaway

For AI Architects and Research Scientists deploying LLMs in multi-stage reasoning systems, you must implement task-specific validation mechanisms rather than relying solely on model scale. Specifically, integrate robust error-checking for mathematical computations and external verification for dimensional reasoning tasks, as even frontier models show significant vulnerability in these areas. Do not assume LLMs will reliably self-correct embedded errors or misinformation.

Key insights

LLM robustness to CoT perturbations varies significantly by type and model scale, challenging assumptions that size alone ensures reliability.

Principles

Larger LLMs exhibit greater resilience to arithmetic errors.
Dimensional reasoning remains a universal challenge for LLMs.
LLMs effectively filter redundant information in reasoning chains.

Method

The study systematically injected five types of perturbations into the last intermediate step of partial CoT solutions from the GSM8K dataset and measured accuracy degradation across 13 LLMs.

In practice

Implement external numerical verification for LLM-driven math pipelines.
Avoid delegating dimensional tracking tasks to LLMs without external checks.
Consider verbose explanations as a potential mitigation strategy for LLM reasoning.

Topics

Chain-of-Thought Prompting
LLM Robustness
CoT Perturbations
Mathematical Reasoning
Model Scaling Effects

Code references

Mystic-Slice/CoTPerturbation

Best for: AI Architect, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.