AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, extended

Summary

AtManRL is a novel method that enhances the faithfulness of Chain-of-Thought (CoT) reasoning in large language models (LLMs) by integrating differentiable attention manipulation with reinforcement learning. The approach trains an additive attention mask, initialized with a negative constant of -0.4, to identify tokens within the CoT that are crucial for generating correct answers. This mask is optimized for 200 steps to restore the probability of the correct answer, yielding a saliency reward signal. This saliency reward is then combined with outcome-based rewards within the GRPO framework to jointly optimize for both answer correctness and reasoning interpretability. Experiments conducted on 1,000 samples from GSM8K and MMLU datasets using the Llama-3.2-3B-Instruct model demonstrated that AtManRL reduced average reasoning length by 44% on GSM8K and 46% on MMLU, while maintaining comparable task performance. The method also shifted token composition towards more information-dense content, with a 25% decrease in stop words and a 48% increase in numbers and 61% increase in symbols on GSM8K.

Key takeaway

For research scientists developing or fine-tuning LLMs, AtManRL offers a principled way to improve reasoning trace faithfulness and efficiency. You should consider integrating differentiable attention manipulation and saliency-based rewards into your reinforcement learning pipelines to reduce extraneous reasoning content without sacrificing accuracy. This approach can lead to more transparent and cost-effective LLM deployments, particularly for tasks requiring concise, causally linked explanations.

Key insights

AtManRL uses differentiable attention masks and RL to make LLM reasoning traces more salient and efficient.

Principles

Method

Initialize an additive attention mask with negative values to suppress CoT token influence, then optimize it to restore correct answer probability. Derive a saliency reward from the optimized mask and integrate it into GRPO-based reinforcement learning.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.