LASA: A Weak Supervision Method for Open-Vocabulary Scene Sketch Semantic Segmentation

2026-06-10 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

LASA, a weak supervision method, addresses open-vocabulary scene sketch semantic segmentation, which involves assigning dense semantic labels to sparse line drawings using flexible category vocabularies at inference time, without pixel-level training annotations. Recognizing that sketches lack texture and color, making semantic understanding dependent on stroke layout, the method tackles the instability of single-layer vision-language features. It leverages the observation that different Vision Transformer layers encode complementary spatial cues: shallow layers capture global structural layouts, while deeper layers focus on local stroke intersections. LASA aggregates multi-layer attention to guide hierarchical semantic alignment under weak supervision and refine predictions. Experiments show LASA improves mIoU by +3.43 on FS-COCO, +8.01 on SFSD, and +15.74 on FrISS over prior weakly supervised baselines.

Key takeaway

For Computer Vision Engineers developing semantic segmentation for sparse line drawings, recognize that single-layer vision-language features are inherently unstable due to the lack of texture cues. You should explore multi-layer attention aggregation, as demonstrated by LASA, which leverages complementary spatial cues from different Vision Transformer layers. Implementing such a structure-aware framework can significantly improve segmentation accuracy and spatial coherence, yielding substantial mIoU gains on sketch datasets.

Key insights

Cross-layer attention aggregation provides robust structural priors for open-vocabulary sketch semantic segmentation.

Principles

Vision Transformer layers encode complementary spatial cues.
Shallow layers capture global structural layouts.
Deeper layers focus on local stroke intersections and object parts.

Method

The LASA framework aggregates multi-layer attention from Vision Transformers to guide hierarchical semantic alignment under weak supervision and refine inference predictions for sketch segmentation.

In practice

Utilize multi-layer attention for improved sketch semantic segmentation.
Achieve significant mIoU gains on FS-COCO, SFSD, and FrISS datasets.

Topics

Open-Vocabulary Segmentation
Sketch Semantic Segmentation
Weak Supervision
Vision Transformers
Attention Mechanisms
Computer Vision

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.