Retell, Reward, Repeat: Reinforcement Learning for Narrative Theory-Informed Story Retelling
Summary
Researchers at UNSW explored reinforcement learning (d-RLAIF) as a post-training method for Automatic Story Generation (ASG), contrasting it with supervised fine-tuning (SFT). They applied Todorov's Theory of Narrative Equilibrium to define desirable story qualities, using these principles to prompt 7B and 14B LLM-as-judge models (Selene-1-mini-8B, M-Prometheus-14B) for reward signals. Three open-weight LLMs (Llama-3.1-8B, Olmo-3-7B, Qwen-3-8B) were then post-trained using d-RLAIF with GRPO and LoRA on the TimeTravel dataset. Evaluation with Gemini-3-Flash showed d-RLAIF produced more diverse stories aligned with human narrative conventions, outperforming SFT in overall quality (minLRC) when using a narrativity-based reward signal (RN). SFT, however, yielded higher linguistic similarity and structural completeness to original stories. The study highlights d-RLAIF's promise for linguistically grounded ASG.
Key takeaway
For AI Scientists and Machine Learning Engineers developing Automatic Story Generation systems, consider integrating direct reinforcement learning from AI feedback (d-RLAIF) with narrative theory-informed reward models. This approach, particularly using narrativity-based signals, can yield more diverse and human-aligned stories than traditional supervised fine-tuning. Focus on carefully designing your LLM-as-judge prompts and reward structures, as their characteristics significantly influence model convergence and output quality, even with smaller 8B models.
Key insights
Reinforcement learning with narrative theory-informed AI feedback improves story diversity and human narrative alignment.
Principles
- Narratives follow a 5-stage equilibrium structure (Todorov's theory).
- Story quality requires logical, rational, and narratively complete elements.
- LLM-as-judge "harshness" impacts d-RLAIF training dynamics.
Method
Post-train LLMs using d-RLAIF, where an LLM-as-judge, prompted with narrative theory principles, generates reward signals for GRPO optimization.
In practice
- Use Todorov's 5-stage theory to define ASG evaluation criteria.
- Employ d-RLAIF with narrativity-based rewards for diverse story generation.
- Consider LLM-as-judge characteristics beyond accuracy for reward modeling.
Topics
- Automatic Story Generation
- Reinforcement Learning from AI Feedback
- Narrative Theory
- LLM-as-Judge
- Todorov's Theory of Narrative Equilibrium
- Large Language Models
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.