When is 3D Worth It? A Resource-Performance Frontier for CNNs and Transformers in Lung CT

· Source: cs.CV updates on arXiv.org · Field: Science & Research — Health & Medical Research, Artificial Intelligence & Machine Learning · Depth: Advanced, medium

Summary

A study investigated the practical value of input dimensionality (2D, 2.5D, 3D) for convolutional neural networks (CNNs) and Vision Transformers (ViTs) in lung CT imaging, focusing on lung cancer screening classification. Using a leakage-free NLST cohort ($n=1{,}977$) and a fixed training protocol, researchers mapped a resource–performance frontier and characterized failure modes. The 2.5D CNN demonstrated the most favorable discrimination–stability trade-off, achieving an ROC-AUC of 0.682 (95% CI [0.546, 0.799]) with a stable operating point. In contrast, 3D CNNs exhibited threshold instability, while ViTs frequently produced degenerate predictions, such as all-positive classifications, and required approximately 3x more GPU memory. The findings suggest that for class-imbalanced lung cancer screening, 2D and 2.5D inputs offer a more reliable balance of performance, stability, and computational efficiency compared to full 3D representations, despite the common assumption of 3D superiority.

Key takeaway

For machine learning engineers developing diagnostic models for class-imbalanced lung CT screening, you should prioritize 2.5D convolutional neural networks over full 3D or Vision Transformer architectures. Your decision to use higher-dimensional inputs must be carefully weighed against the increased computational cost and risk of model instability or degenerate predictions. Always evaluate model operating behavior, such as threshold stability and sensitivity/specificity, in addition to AUC metrics, to ensure clinical usability and avoid deployment of unreliable systems.

Key insights

Higher dimensionality in medical imaging models does not guarantee stable performance, especially with class imbalance.

Principles

Method

Compare 2D, 2.5D, and 3D inputs across matched CNN and ViT backbones under a fixed protocol, evaluating discrimination, operating behavior (threshold stability), and computational cost.

In practice

Topics

Best for: Computer Vision Engineer, AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.