AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency
Summary
AtManRL is a novel method that enhances the faithfulness of Chain-of-Thought (CoT) reasoning in large language models (LLMs) by integrating differentiable attention manipulation with reinforcement learning. The approach trains an additive attention mask, initialized with a negative constant of -0.4, to identify tokens within the CoT that are crucial for generating correct answers. This mask is optimized for 200 steps to restore the probability of the correct answer, yielding a saliency reward signal. This saliency reward is then combined with outcome-based rewards within the GRPO framework to jointly optimize for both answer correctness and reasoning interpretability. Experiments conducted on 1,000 samples from GSM8K and MMLU datasets using the Llama-3.2-3B-Instruct model demonstrated that AtManRL reduced average reasoning length by 44% on GSM8K and 46% on MMLU, while maintaining comparable task performance. The method also shifted token composition towards more information-dense content, with a 25% decrease in stop words and a 48% increase in numbers and 61% increase in symbols on GSM8K.
Key takeaway
For research scientists developing or fine-tuning LLMs, AtManRL offers a principled way to improve reasoning trace faithfulness and efficiency. You should consider integrating differentiable attention manipulation and saliency-based rewards into your reinforcement learning pipelines to reduce extraneous reasoning content without sacrificing accuracy. This approach can lead to more transparent and cost-effective LLM deployments, particularly for tasks requiring concise, causally linked explanations.
Key insights
AtManRL uses differentiable attention masks and RL to make LLM reasoning traces more salient and efficient.
Principles
- Saliency is a necessary condition for faithfulness in LLM explanations.
- Differentiable attention masks can identify causally influential tokens.
- Jointly optimize for correctness and reasoning saliency via RL.
Method
Initialize an additive attention mask with negative values to suppress CoT token influence, then optimize it to restore correct answer probability. Derive a saliency reward from the optimized mask and integrate it into GRPO-based reinforcement learning.
In practice
- Apply AtManRL to reduce LLM inference costs.
- Use saliency rewards to prioritize information-dense tokens.
- Train models to generate shorter, more impactful reasoning.
Topics
- AtManRL
- Faithful Reasoning
- Differentiable Attention Manipulation
- Reinforcement Learning
- Chain-of-Thought
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.