TEMPER: Testing Emotional Perturbation in Quantitative Reasoning

2026-04-10 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

A study by Dokme, Reichman, and Heck from Georgia Institute of Technology introduces TEMPER, a framework and benchmark designed to evaluate how emotional framing affects large language models' (LLMs) quantitative reasoning. The researchers developed a controlled emotion translation framework to rewrite math problems into emotional variants while preserving all numerical content and relationships. Using this, they constructed Temper-5400, a dataset of 5,400 semantically verified emotion-neutral pairs across GSM8K, MultiArith, and ARC-Challenge. Evaluating eighteen models (from 1B to frontier scale), they found that emotional framing alone reduces accuracy by 2–10 percentage points. Crucially, neutralizing these emotional variants recovered most lost performance, indicating the degradation is tied to emotional style, not content corruption. Non-emotional paraphrases caused no such degradation, further implicating emotional content. The benchmark construction method also offers a general framework for controlled stylistic translation and robustness evaluation.

Key takeaway

For AI Engineers and Research Scientists deploying LLMs in user-facing applications, you should implement inference-time mitigation strategies to neutralize emotional input. The observed 2-10% accuracy drop in quantitative reasoning due to emotional framing, even in frontier models, highlights a critical robustness vulnerability. Consider pre-processing user queries to a neutral tone to recover lost performance, especially for tasks requiring precise numerical reasoning, and prioritize testing against negative emotions like "disgust" and "fear" which are shown to be most disruptive.

Key insights

Emotional framing significantly degrades LLM quantitative reasoning, a deficit largely recoverable by neutralizing emotional style.

Principles

Emotional style, not content, drives reasoning degradation.
Larger models are more robust but not immune to emotional framing.
Disgust is the most disruptive emotion for quantitative reasoning.

Method

A teacher-student framework trains emotion translators using Llama 3.1-8B with LoRA, guided by a frozen DistilRoBERTa emotion classifier's latent space to control emotional intensity and preserve mathematical content.

In practice

Neutralize emotional user queries at inference time.
Prioritize robustness testing against "disgust" and "fear" emotional tones.
Use 100-dimensional latent space for fine-grained emotion control.

Topics

Emotional Perturbation
Quantitative Reasoning
LLM Robustness
Temper-5400 Benchmark
Emotion Translation Framework

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.