TEMPER: Testing Emotional Perturbation in Quantitative Reasoning

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

A study by Dokme, Reichman, and Heck from Georgia Institute of Technology introduces TEMPER, a framework and benchmark designed to evaluate how emotional framing affects large language models' (LLMs) quantitative reasoning. The researchers developed a controlled emotion translation framework to rewrite math problems into emotional variants while preserving all numerical content and relationships. Using this, they constructed Temper-5400, a dataset of 5,400 semantically verified emotion-neutral pairs across GSM8K, MultiArith, and ARC-Challenge. Evaluating eighteen models (from 1B to frontier scale), they found that emotional framing alone reduces accuracy by 2–10 percentage points. Crucially, neutralizing these emotional variants recovered most lost performance, indicating the degradation is tied to emotional style, not content corruption. Non-emotional paraphrases caused no such degradation, further implicating emotional content. The benchmark construction method also offers a general framework for controlled stylistic translation and robustness evaluation.

Key takeaway

For AI Engineers and Research Scientists deploying LLMs in user-facing applications, you should implement inference-time mitigation strategies to neutralize emotional input. The observed 2-10% accuracy drop in quantitative reasoning due to emotional framing, even in frontier models, highlights a critical robustness vulnerability. Consider pre-processing user queries to a neutral tone to recover lost performance, especially for tasks requiring precise numerical reasoning, and prioritize testing against negative emotions like "disgust" and "fear" which are shown to be most disruptive.

Key insights

Emotional framing significantly degrades LLM quantitative reasoning, a deficit largely recoverable by neutralizing emotional style.

Principles

Method

A teacher-student framework trains emotion translators using Llama 3.1-8B with LoRA, guided by a frozen DistilRoBERTa emotion classifier's latent space to control emotional intensity and preserve mathematical content.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.