CORTEX: A Structured Reasoning Benchmark for Trustworthy 3D Chest CT MLLMs

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Pattern Recognition, Health & Medical Research · Depth: Expert, quick

Summary

CORTEX (Clinically Organized Reasoning and sTructured EXplanation) is a new structured reasoning benchmark designed for 3D chest CT multimodal large language models (MLLMs). It addresses the challenge of interpreting and verifying free-form MLLM reasoning in medical imaging, where current datasets lack the diagnostic traces and patient history crucial for trustworthy diagnoses. CORTEX restores this missing reasoning by providing 76,177 validated traces, each following a four-stage radiologist workflow: task understanding, visual observation, diagnostic reasoning, and answer synthesis. These traces are generated using frontier large language models, then rigorously filtered and verified through automated rubric scoring and expert radiologist review, with the structure and rubrics developed in collaboration with clinicians. Built upon the CT-RATE dataset, CORTEX supports open-ended VQA, closed-ended VQA, and report generation, offering both structured supervision and a stage-level evaluation protocol for developing reliable 3D chest CT MLLMs. The dataset and evaluation code will be publicly available upon acceptance.

Key takeaway

For Machine Learning Engineers developing 3D chest CT MLLMs, CORTEX provides a critical resource for building more trustworthy models. You should integrate its 76,177 validated, four-stage diagnostic traces as structured supervision to train your models. Additionally, utilize the stage-level evaluation protocol and clinician-designed rubrics to rigorously verify your MLLM's reasoning, ensuring its diagnostic conclusions are traceable and clinically sound. This approach can significantly enhance model interpretability and reliability.

Key insights

CORTEX provides structured, verifiable reasoning traces for 3D chest CT MLLMs, mirroring clinical diagnostic workflows.

Principles

Method

Generate four-stage diagnostic traces (task understanding, visual observation, diagnostic reasoning, answer synthesis) using frontier LLMs, then filter and verify via automated rubrics and expert review.

In practice

Topics

Best for: Computer Vision Engineer, AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.