StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives

2025-07-20 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Gaming & Interactive Media · Depth: Advanced, extended

Summary

StoryTailor is a zero-shot pipeline designed to generate multi-frame, action-rich visual narratives with consistent subject identities and cross-frame background continuity, operating efficiently on a single RTX 4090 (24 GB) GPU. It addresses challenges like action text faithfulness, subject identity fidelity, and background continuity through three core modules: Gaussian-Centered Attention (GCA) for dynamic subject focus and managing grounding-box overlaps, Action-Boost Singular Value Reweighting (AB-SVR) to amplify action-related directions in text embeddings, and Selective Forgetting Cache (SFC) for retaining transferable background cues and building semantic ties across scenes. Experiments show StoryTailor improves CLIP-T scores by 10–15% compared to baselines, maintains competitive CLIP-I, and offers faster inference than FluxKontext at matched resolution and steps, delivering expressive interactions and stable, evolving scenes.

Key takeaway

For AI Scientists and Computer Vision Engineers developing narrative image synthesis systems, StoryTailor offers a practical zero-shot approach to generating multi-subject, action-rich visual narratives on consumer-grade GPUs. You should consider adopting its modular design, particularly Gaussian-Centered Attention for subject decoupling, Action-Boost SVR for enhancing action fidelity, and Selective Forgetting Cache for maintaining scene continuity, to improve output quality and computational efficiency in your projects.

Key insights

StoryTailor generates consistent multi-subject visual narratives zero-shot on a single GPU by integrating specialized attention, text embedding reweighting, and selective caching.

Principles

Explicitly localize attention on subjects to prevent identity confusion.
Enhance action representation in text embeddings for diverse behaviors.
Propagate background context selectively to maintain scene continuity.

Method

StoryTailor integrates Resampler into Stable Diffusion XL for multi-subject conditioning, applies GCA for dynamic subject masking, uses AB-SVR to weight action-related text embeddings, and employs SFC for adaptive KV cache management to ensure cross-frame continuity.

In practice

Use Gaussian-Centered Attention to soften box boundaries and reduce background carryover.
Apply Action-Boost SVR to text embeddings to strengthen verb semantics.
Implement Selective Forgetting Cache for controlled background context propagation.

Topics

Visual Narrative Generation
Diffusion Models
Zero-Shot Learning
Multi-Subject Image Synthesis
Attention Mechanisms

Best for: Computer Vision Engineer, AI Scientist, Research Scientist, AI Researcher, AI Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.