RARE disease detection from Capsule Endoscopic Videos based on Vision Transformers

2026-03-20 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Medical Imaging Analysis · Depth: Advanced, quick

Summary

A recent study submitted on March 16, 2026, details a deep learning approach for multi-label classification of RARE diseases from capsule endoscopic videos (CEV). The researchers fine-tuned a Google Vision Transformer (ViT) with a batch size of 16 and 224x224 pixel resolution for this task. The model was trained to classify 17 distinct labels, including anatomical locations like mouth, esophagus, and stomach, as well as pathological findings such as active bleeding, angiectasia, erosion, polyp, and ulcer. On a test dataset comprising three videos, the system achieved an overall mean Average Precision (mAP) of 0.0205 at an Intersection over Union (IoU) threshold of 0.5, and 0.0196 at an IoU threshold of 0.95.

Key takeaway

For computer vision engineers developing diagnostic tools for gastroenterology, this work demonstrates the application of Vision Transformers to identify multiple rare diseases from capsule endoscopic videos. You should consider fine-tuning pre-trained ViT models for similar multi-label classification tasks in medical imaging, especially when dealing with diverse pathological findings. Evaluate performance using mAP @0.5 and mAP @0.95 for comprehensive assessment.

Key insights

Vision Transformers can be fine-tuned for multi-label classification of gastrointestinal diseases from capsule endoscopic videos.

Principles

ViT models are adaptable for medical image analysis.
Multi-label classification addresses diverse pathologies.

Method

The method involves fine-tuning a Google Vision Transformer (ViT) with batch16 and 224x224 resolution on capsule endoscopic videos to classify 17 specific gastrointestinal labels.

In practice

Apply ViT for medical video analysis.
Consider multi-label classification for complex diagnoses.

Topics

Vision Transformers
Capsule Endoscopy
Medical Image Classification
Multi-label Classification
Gastrointestinal Disease Detection

Best for: Computer Vision Engineer, Research Scientist, AI Researcher, AI Scientist, Deep Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.