StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Gaming & Interactive Media · Depth: Advanced, extended

Summary

StoryTailor is a zero-shot pipeline designed to generate multi-frame, action-rich visual narratives with consistent subject identities and cross-frame background continuity, operating efficiently on a single RTX 4090 (24 GB) GPU. It addresses challenges like action text faithfulness, subject identity fidelity, and background continuity through three core modules: Gaussian-Centered Attention (GCA) for dynamic subject focus and managing grounding-box overlaps, Action-Boost Singular Value Reweighting (AB-SVR) to amplify action-related directions in text embeddings, and Selective Forgetting Cache (SFC) for retaining transferable background cues and building semantic ties across scenes. Experiments show StoryTailor improves CLIP-T scores by 10–15% compared to baselines, maintains competitive CLIP-I, and offers faster inference than FluxKontext at matched resolution and steps, delivering expressive interactions and stable, evolving scenes.

Key takeaway

For AI Scientists and Computer Vision Engineers developing narrative image synthesis systems, StoryTailor offers a practical zero-shot approach to generating multi-subject, action-rich visual narratives on consumer-grade GPUs. You should consider adopting its modular design, particularly Gaussian-Centered Attention for subject decoupling, Action-Boost SVR for enhancing action fidelity, and Selective Forgetting Cache for maintaining scene continuity, to improve output quality and computational efficiency in your projects.

Key insights

StoryTailor generates consistent multi-subject visual narratives zero-shot on a single GPU by integrating specialized attention, text embedding reweighting, and selective caching.

Principles

Method

StoryTailor integrates Resampler into Stable Diffusion XL for multi-subject conditioning, applies GCA for dynamic subject masking, uses AB-SVR to weight action-related text embeddings, and employs SFC for adaptive KV cache management to ensure cross-frame continuity.

In practice

Topics

Best for: Computer Vision Engineer, AI Scientist, Research Scientist, AI Researcher, AI Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.