Autonomous Skeletal Landmark Localization towards Agentic C-Arm Control

· Source: cs.CV updates on arXiv.org · Field: Health & Wellbeing — Artificial Intelligence & Machine Learning, Medical Devices & Health Technology, Robotics & Autonomous Systems · Depth: Advanced, extended

Summary

Researchers from the University of Vermont and Cleveland Clinic have developed an agentic C-arm control framework using fine-tuned multimodal large language models (MLLMs) for autonomous skeletal landmark localization. This approach addresses limitations of conventional deep learning (DL) methods, which lack reasoning capabilities and cannot incorporate clinician feedback. The study fine-tuned Gemma-3 (4B, 12B, 27B parameters) and Qwen-2.5VL (7B, 32B parameters) MLLMs on both synthetic X-ray datasets (51,200 training, 10,240 testing image-answer pairs) and a real X-ray dataset (1,564 training, 174 testing images). Quantitative evaluations showed fine-tuned MLLMs achieved competitive performance against a leading DL approach in landmark localization, with the fine-tuned Gemma-3 27B model reaching Hit@2: 0.85 and Hit@1: 0.74 on synthetic data. Qualitative experiments demonstrated the MLLMs' ability to correct initial predictions through reasoning and sequentially navigate the C-arm towards a target location, indicating potential for intelligent assistance in surgical interventions.

Key takeaway

For AI Scientists developing medical imaging guidance systems, this research indicates that fine-tuned MLLMs offer a path beyond fixed image-to-motion mapping. You should consider integrating MLLMs to enable reasoning, incorporate clinician feedback, and achieve multi-step C-arm navigation, especially in critical procedures like stroke thrombectomy. While current DL models may offer higher initial precision, MLLMs provide crucial agentic capabilities for robust, adaptive control.

Key insights

Fine-tuned MLLMs can perform autonomous skeletal landmark localization and C-arm navigation with reasoning capabilities.

Principles

Method

MLLMs are fine-tuned using supervised learning with QLoRA on X-ray images paired with ordered closest skeletal landmarks, enabling spatial reasoning and C-arm navigation via chain-of-thought prompting.

In practice

Topics

Code references

Best for: AI Scientist, Research Scientist, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.