Retell, Reward, Repeat: Reinforcement Learning for Narrative Theory-Informed Story Retelling

2025-12-27 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, extended

Summary

Researchers at UNSW explored reinforcement learning (d-RLAIF) as a post-training method for Automatic Story Generation (ASG), contrasting it with supervised fine-tuning (SFT). They applied Todorov's Theory of Narrative Equilibrium to define desirable story qualities, using these principles to prompt 7B and 14B LLM-as-judge models (Selene-1-mini-8B, M-Prometheus-14B) for reward signals. Three open-weight LLMs (Llama-3.1-8B, Olmo-3-7B, Qwen-3-8B) were then post-trained using d-RLAIF with GRPO and LoRA on the TimeTravel dataset. Evaluation with Gemini-3-Flash showed d-RLAIF produced more diverse stories aligned with human narrative conventions, outperforming SFT in overall quality (minLRC) when using a narrativity-based reward signal (RN). SFT, however, yielded higher linguistic similarity and structural completeness to original stories. The study highlights d-RLAIF's promise for linguistically grounded ASG.

Key takeaway

For AI Scientists and Machine Learning Engineers developing Automatic Story Generation systems, consider integrating direct reinforcement learning from AI feedback (d-RLAIF) with narrative theory-informed reward models. This approach, particularly using narrativity-based signals, can yield more diverse and human-aligned stories than traditional supervised fine-tuning. Focus on carefully designing your LLM-as-judge prompts and reward structures, as their characteristics significantly influence model convergence and output quality, even with smaller 8B models.

Key insights

Reinforcement learning with narrative theory-informed AI feedback improves story diversity and human narrative alignment.

Principles

Narratives follow a 5-stage equilibrium structure (Todorov's theory).
Story quality requires logical, rational, and narratively complete elements.
LLM-as-judge "harshness" impacts d-RLAIF training dynamics.

Method

Post-train LLMs using d-RLAIF, where an LLM-as-judge, prompted with narrative theory principles, generates reward signals for GRPO optimization.

In practice

Use Todorov's 5-stage theory to define ASG evaluation criteria.
Employ d-RLAIF with narrativity-based rewards for diverse story generation.
Consider LLM-as-judge characteristics beyond accuracy for reward modeling.

Topics

Automatic Story Generation
Reinforcement Learning from AI Feedback
Narrative Theory
LLM-as-Judge
Todorov's Theory of Narrative Equilibrium
Large Language Models

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.