Pseudo-Text-Conditioned 3D Grounding DINO for Organ Localization in Abdominal CT

· Source: Computer Vision and Pattern Recognition · Field: Science & Research — Health & Medical Research, Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

CT-3GDINO is a lightweight 3D detector designed for reliable organ localization in abdominal CT scans, providing spatial priors for downstream trauma analysis. This model adapts a Grounding-DINO-style query-based architecture, utilizing frozen pseudo-text class tokens instead of a real text encoder. It integrates a Swin3D visual backbone, bidirectional feature enhancement, pseudo-text-guided query selection, and a cross-modality decoder to predict normalized 3D boxes for the liver, spleen, left kidney, right kidney, and bowel. Trained and evaluated on 193 matched RSNA/RATIC CT volumes, the best multi-scale variant achieved 0.5830 overall top-1 class-wise mAP over 3D IoU thresholds from 0.1 to 0.7. This performance surpassed classification-pretrained variants, which scored 0.5570 mAP (fixed-backbone) and 0.4657 mAP (trainable-backbone). While strong for coarse localization (0.9649 AP at IoU 0.1), strict box alignment remains limited (0.1552 AP at IoU 0.7). CT-3GDINO serves as an open-source baseline for pseudo-text-conditioned 3D organ localization.

Key takeaway

For AI Scientists developing medical image analysis tools, CT-3GDINO offers a novel pseudo-text-conditioned 3D organ localization baseline. You should consider this lightweight architecture for initial spatial prior generation in abdominal CT trauma analysis, especially where coarse localization is sufficient. However, be aware of its current limitations for strict box alignment (0.1552 AP at IoU 0.7) and plan to integrate localization-aware pretraining or richer multimodal conditioning to enhance precision for critical applications.

Key insights

A lightweight 3D detector adapts Grounding DINO using frozen pseudo-text tokens for organ localization in abdominal CT.

Principles

Method

CT-3GDINO combines a Swin3D visual backbone, bidirectional feature enhancement, pseudo-text-guided query selection, and a cross-modality decoder to predict 3D organ boxes.

In practice

Topics

Best for: AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.