Pseudo-Text-Conditioned 3D Grounding DINO for Organ Localization in Abdominal CT
Summary
CT-3GDINO is a lightweight 3D detector designed for reliable organ localization in abdominal CT scans, providing spatial priors for downstream trauma analysis. This model adapts a Grounding-DINO-style query-based architecture, utilizing frozen pseudo-text class tokens instead of a real text encoder. It integrates a Swin3D visual backbone, bidirectional feature enhancement, pseudo-text-guided query selection, and a cross-modality decoder to predict normalized 3D boxes for the liver, spleen, left kidney, right kidney, and bowel. Trained and evaluated on 193 matched RSNA/RATIC CT volumes, the best multi-scale variant achieved 0.5830 overall top-1 class-wise mAP over 3D IoU thresholds from 0.1 to 0.7. This performance surpassed classification-pretrained variants, which scored 0.5570 mAP (fixed-backbone) and 0.4657 mAP (trainable-backbone). While strong for coarse localization (0.9649 AP at IoU 0.1), strict box alignment remains limited (0.1552 AP at IoU 0.7). CT-3GDINO serves as an open-source baseline for pseudo-text-conditioned 3D organ localization.
Key takeaway
For AI Scientists developing medical image analysis tools, CT-3GDINO offers a novel pseudo-text-conditioned 3D organ localization baseline. You should consider this lightweight architecture for initial spatial prior generation in abdominal CT trauma analysis, especially where coarse localization is sufficient. However, be aware of its current limitations for strict box alignment (0.1552 AP at IoU 0.7) and plan to integrate localization-aware pretraining or richer multimodal conditioning to enhance precision for critical applications.
Key insights
A lightweight 3D detector adapts Grounding DINO using frozen pseudo-text tokens for organ localization in abdominal CT.
Principles
- Pseudo-text tokens can guide query-based 3D detection.
- Localization performance varies significantly with IoU thresholds.
Method
CT-3GDINO combines a Swin3D visual backbone, bidirectional feature enhancement, pseudo-text-guided query selection, and a cross-modality decoder to predict 3D organ boxes.
In practice
- Utilize CT-3GDINO as an open-source baseline for 3D organ localization.
- Explore localization-aware pretraining for improved strict box alignment.
Topics
- 3D Object Detection
- Organ Localization
- Abdominal CT
- Grounding DINO
- Pseudo-Text Conditioning
- Medical Imaging
Best for: AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.