StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives
Summary
StoryTailor is a zero-shot pipeline designed to generate multi-frame, action-rich visual narratives with consistent subject identities and cross-frame background continuity, operating efficiently on a single RTX 4090 (24 GB) GPU. It addresses challenges like action text faithfulness, subject identity fidelity, and background continuity through three core modules: Gaussian-Centered Attention (GCA) for dynamic subject focus and managing grounding-box overlaps, Action-Boost Singular Value Reweighting (AB-SVR) to amplify action-related directions in text embeddings, and Selective Forgetting Cache (SFC) for retaining transferable background cues and building semantic ties across scenes. Experiments show StoryTailor improves CLIP-T scores by 10–15% compared to baselines, maintains competitive CLIP-I, and offers faster inference than FluxKontext at matched resolution and steps, delivering expressive interactions and stable, evolving scenes.
Key takeaway
For AI Scientists and Computer Vision Engineers developing narrative image synthesis systems, StoryTailor offers a practical zero-shot approach to generating multi-subject, action-rich visual narratives on consumer-grade GPUs. You should consider adopting its modular design, particularly Gaussian-Centered Attention for subject decoupling, Action-Boost SVR for enhancing action fidelity, and Selective Forgetting Cache for maintaining scene continuity, to improve output quality and computational efficiency in your projects.
Key insights
StoryTailor generates consistent multi-subject visual narratives zero-shot on a single GPU by integrating specialized attention, text embedding reweighting, and selective caching.
Principles
- Explicitly localize attention on subjects to prevent identity confusion.
- Enhance action representation in text embeddings for diverse behaviors.
- Propagate background context selectively to maintain scene continuity.
Method
StoryTailor integrates Resampler into Stable Diffusion XL for multi-subject conditioning, applies GCA for dynamic subject masking, uses AB-SVR to weight action-related text embeddings, and employs SFC for adaptive KV cache management to ensure cross-frame continuity.
In practice
- Use Gaussian-Centered Attention to soften box boundaries and reduce background carryover.
- Apply Action-Boost SVR to text embeddings to strengthen verb semantics.
- Implement Selective Forgetting Cache for controlled background context propagation.
Topics
- Visual Narrative Generation
- Diffusion Models
- Zero-Shot Learning
- Multi-Subject Image Synthesis
- Attention Mechanisms
Best for: Computer Vision Engineer, AI Scientist, Research Scientist, AI Researcher, AI Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.