ARTEMIS: Agent-guided Reliability-aware Temporal Mask Evolution for Imperfectly Supervised Video Polyp Segmentation

2026-06-18 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Health & Medical Research · Depth: Expert, quick

Summary

ARTEMIS, a unified framework for imperfectly supervised video polyp segmentation (VPS), addresses challenges like weak contrast, motion blur, and sparse pixel-level guidance in medical imaging. While SAM2 can generate initial dense masks from weak annotations (points, scribbles) or semi-supervision, direct pseudo-labeling often results in geometry-degraded masks and underutilizes temporal consistency. ARTEMIS overcomes these by initializing coarse masks, then employing a debate-and-judge vision-language agent to select reliable temporal anchors. These anchors are bidirectionally propagated with SAM2 to refine unreliable or unlabeled frames. Finally, the framework trains the segmenter using temporal reliability-aware robust learning, incorporating reliability-guided reference selection, a Reference Prototype Transport Module, and a reliability-aware robust loss. Experiments on SUN-SEG and CVC-ClinicDB-612 datasets demonstrate that ARTEMIS achieves leading performance across scribble, point, and limited-label settings.

Key takeaway

For Computer Vision Engineers developing medical image segmentation with limited labels, ARTEMIS offers a robust framework. You should consider its agent-guided reliability and temporal mask evolution to improve accuracy and consistency in video polyp segmentation, especially when dealing with weak annotations or semi-supervision. Explore the upcoming code release to integrate these advanced techniques into your projects.

Key insights

ARTEMIS improves imperfectly supervised video polyp segmentation by integrating agent-guided reliability and temporal mask evolution.

Principles

Reliability assessment improves weak supervision.
Temporal consistency refines sparse labels.
Robust learning down-weights noisy data.

Method

ARTEMIS initializes masks, uses a vision-language agent to select reliable temporal anchors, propagates them bidirectionally with SAM2, and trains with reliability-aware robust learning.

In practice

Apply SAM2 for initial weak mask generation.
Use agent-guided selection for anchor reliability.
Implement robust loss for noisy labels.

Topics

Video Polyp Segmentation
Imperfect Supervision
Temporal Consistency
Reliability-aware Learning
Vision-Language Agents
SAM2

Code references

wangtong627/ARTEMIS

Best for: AI Scientist, Computer Vision Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.