SA-VIS: Sparse frame Annotations for training Video Instance Segmentation

2026-06-18 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

SA-VIS (Sparse frame Annotation Video Instance Segmentation) is a novel method designed to reduce the computational and annotation costs associated with training online video instance segmentation (VIS) models. Traditional online VIS approaches, while outperforming single-image models, require long sequences of densely annotated frames. SA-VIS addresses this by demonstrating that effective instance modeling and evolution in videos do not necessitate dense annotations. It introduces a simple, low-compute Past-frames Feature Propagation (PFP) module, which aggregates low-dimensional features from multiple frames, combined with light-weight frame-specific Instance Queries. This design allows for end-to-end training with sparse video frame labels, significantly improving performance over its baseline. SA-VIS achieves nearly equivalent accuracy, with only a 0.4% performance drop, when trained using annotations for just 1/5 of the dataset images. It shows strong improvements on YouTube-VIS 2019/2021/2022 and Occluded VIS (OVIS), and over 1% AP improvement on state-of-the-art methods in limited annotation scenarios.

Key takeaway

For Machine Learning Engineers developing video instance segmentation models, especially when facing high annotation costs or compute constraints, SA-VIS offers a compelling solution. You can achieve near state-of-the-art performance with significantly reduced annotation effort, potentially using only 1/5 of the traditionally required dense labels. This allows you to accelerate model development and deployment while maintaining high accuracy, making efficient use of your annotation budget and computational resources. Consider integrating the Past-frames Feature Propagation (PFP) module into your VIS architectures.

Key insights

SA-VIS enables high-performance video instance segmentation with significantly fewer annotations by propagating past-frame features.

Principles

Dense video annotations are not essential for effective VIS training.
Low-dimensional feature aggregation across frames is highly effective.
Simple designs can bridge accuracy gaps in sparse annotation scenarios.

Method

SA-VIS employs a Past-frames Feature Propagation (PFP) module to aggregate low-dimensional features from multiple frames, combined with light-weight frame-specific Instance Queries for end-to-end training with sparse labels.

In practice

Train VIS models with 1/5 of original dense annotations.
Apply PFP module for efficient feature aggregation.
Improve AP by over 1% in limited annotation settings.

Topics

Video Instance Segmentation
Sparse Annotation
Feature Propagation
Deep Learning Optimization
YouTube-VIS Dataset
Occluded VIS Dataset

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.