ReSum: Synergizing LLM Reasoning and Summarization with Reinforcement Learning

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, long

Summary

ReSum is a novel Reinforcement Learning with Verifiable Rewards (RLVR) framework developed by the University of Science and Technology of China and AMAP, Alibaba Group, designed to enhance Large Language Model (LLM) reasoning by integrating self-summarization. Existing RLVR methods often produce excessively long reasoning rollouts, which can reduce coherence and deplete context budgets. ReSum addresses this by enabling LLMs to compress and organize their reasoning trajectories intrinsically. Pilot studies revealed that self-summarization stabilizes generation by decreasing token-level entropy and significantly reduces error propagation from incorrect rollout prefixes. Motivated by these findings, ReSum employs an adaptive rollout mechanism that contrastively evaluates the utility of self-summarization, using both spontaneous and injected summarization phrases. This framework improves LLM performance by an average of 4% while simultaneously reducing rollout length by 18.6%.

Key takeaway

For Machine Learning Engineers optimizing LLM long-horizon reasoning, consider integrating self-summarization techniques like ReSum. This approach can significantly improve accuracy by 4% and reduce rollout length by 18.6%, mitigating issues like context exhaustion and error propagation. You should explore adaptive rollout mechanisms that use contrastive evaluation to determine optimal summarization points, enhancing model coherence and performance.

Key insights

Self-summarization intrinsically stabilizes LLM reasoning, reducing errors and context length in RLVR frameworks.

Principles

Longer LLM reasoning chains can degrade coherence.
Self-summarization lowers token-level entropy.
Summarization mitigates error propagation.

Method

ReSum uses contrastive branching with Natural Points (masking spontaneous summaries) and Artifact Points (injecting summaries) to train LLMs to self-summarize effectively, guided by a summarization-aware advantage.

In practice

Implement self-summarization in RLVR.
Use contrastive learning for summary timing.
Reduce context length in long-horizon tasks.

Topics

Reinforcement Learning
Large Language Models
Self-Summarization
Long-Horizon Reasoning
Context Management
Policy Optimization

Code references

xuc865/Resum

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.