Disease-Centric Vision-Language Pretraining with Hybrid Visual Encoding for 3D Computed Tomography

2026-06-24 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Medical Imaging AI · Depth: Expert, medium

Summary

A new vision-language pre-training (VLP) framework addresses challenges in 3D Computed Tomography (CT) imaging by improving visual backbones and semantic alignment. This framework integrates a CNN-ViT hybrid encoder, which uses a 3D CNN backbone instead of ViT's patch embedding to efficiently capture local anatomical details while maintaining global attention and compatibility with existing cross-modal priors. It also incorporates a disease-level contrastive learning mechanism that employs learnable query tokens to dynamically extract disease-specific semantics from radiology reports, aligning them with corresponding visual features to differentiate distinct diseases within shared anatomical regions. Furthermore, a diagnosis-aware prompt strategy utilizes real clinical phrases and aggregated disease prototypes to enhance zero-shot diagnostic reliability and bridge the pre-training-inference gap. The model achieves state-of-the-art performance, with 84.4% AUC on CT-RATE (+5.1%), 75.4% AUC on Rad-ChestCT (+5.4%), and a notable +9.8% AUC gain on a challenging 60-disease benchmark, also demonstrating strong transferability to radiology report generation.

Key takeaway

For AI Scientists developing medical imaging models, you should consider this VLP framework's components to enhance 3D CT diagnostic accuracy. Integrating a CNN-ViT hybrid encoder can improve local anatomical detail capture, while disease-level contrastive learning will refine semantic alignment. Employing diagnosis-aware prompts with clinical phrases can significantly boost zero-shot diagnostic reliability in your applications. This approach offers a robust path to higher performance in complex medical AI tasks.

Key insights

The framework improves 3D CT diagnosis by integrating hybrid visual encoding, disease-level contrastive learning, and diagnosis-aware prompting.

Principles

Hybrid encoders can optimize local and global feature capture.
Disease-specific alignment improves semantic disentanglement.
Clinical phrasing bridges pre-training to inference gaps.

Method

The framework uses a CNN-ViT hybrid encoder, disease-level contrastive learning with query tokens, and a diagnosis-aware prompt strategy with clinical phrases and disease prototypes.

In practice

Implement 3D CNNs for local anatomical detail in ViT.
Use learnable query tokens for fine-grained disease alignment.
Incorporate clinical phrases for zero-shot diagnostic prompts.

Topics

3D Computed Tomography
Vision-Language Pretraining
Hybrid Encoders
Contrastive Learning
Medical Diagnostics
Radiology Reports

Code references

Best for: Computer Vision Engineer, AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.