CORTEX: A Structured Reasoning Benchmark for Trustworthy 3D Chest CT MLLMs
Summary
CORTEX (Clinically Organized Reasoning and sTructured EXplanation) is a new structured reasoning benchmark designed for 3D chest CT multimodal large language models (MLLMs). It addresses the challenge of interpreting and verifying free-form MLLM reasoning in medical imaging, where current datasets lack the diagnostic traces and patient history crucial for trustworthy diagnoses. CORTEX restores this missing reasoning by providing 76,177 validated traces, each following a four-stage radiologist workflow: task understanding, visual observation, diagnostic reasoning, and answer synthesis. These traces are generated using frontier large language models, then rigorously filtered and verified through automated rubric scoring and expert radiologist review, with the structure and rubrics developed in collaboration with clinicians. Built upon the CT-RATE dataset, CORTEX supports open-ended VQA, closed-ended VQA, and report generation, offering both structured supervision and a stage-level evaluation protocol for developing reliable 3D chest CT MLLMs. The dataset and evaluation code will be publicly available upon acceptance.
Key takeaway
For Machine Learning Engineers developing 3D chest CT MLLMs, CORTEX provides a critical resource for building more trustworthy models. You should integrate its 76,177 validated, four-stage diagnostic traces as structured supervision to train your models. Additionally, utilize the stage-level evaluation protocol and clinician-designed rubrics to rigorously verify your MLLM's reasoning, ensuring its diagnostic conclusions are traceable and clinically sound. This approach can significantly enhance model interpretability and reliability.
Key insights
CORTEX provides structured, verifiable reasoning traces for 3D chest CT MLLMs, mirroring clinical diagnostic workflows.
Principles
- Diagnostic reasoning benefits from a structured, multi-stage workflow.
- Trustworthy MLLMs require traceable evidence linking findings to conclusions.
- Clinician collaboration is vital for medical AI benchmark design.
Method
Generate four-stage diagnostic traces (task understanding, visual observation, diagnostic reasoning, answer synthesis) using frontier LLMs, then filter and verify via automated rubrics and expert review.
In practice
- Use CORTEX for training MLLMs on structured 3D chest CT reasoning.
- Evaluate MLLM diagnostic traces using stage-level rubrics.
Topics
- CORTEX Benchmark
- 3D Chest CT
- Medical MLLMs
- Structured Reasoning
- Diagnostic Workflow
- Medical Imaging
Best for: Computer Vision Engineer, AI Scientist, Research Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.