Autonomous Skeletal Landmark Localization towards Agentic C-Arm Control

2026-04-22 · Source: cs.CV updates on arXiv.org · Field: Health & Wellbeing — Artificial Intelligence & Machine Learning, Medical Devices & Health Technology, Robotics & Autonomous Systems · Depth: Advanced, extended

Summary

Researchers from the University of Vermont and Cleveland Clinic have developed an agentic C-arm control framework using fine-tuned multimodal large language models (MLLMs) for autonomous skeletal landmark localization. This approach addresses limitations of conventional deep learning (DL) methods, which lack reasoning capabilities and cannot incorporate clinician feedback. The study fine-tuned Gemma-3 (4B, 12B, 27B parameters) and Qwen-2.5VL (7B, 32B parameters) MLLMs on both synthetic X-ray datasets (51,200 training, 10,240 testing image-answer pairs) and a real X-ray dataset (1,564 training, 174 testing images). Quantitative evaluations showed fine-tuned MLLMs achieved competitive performance against a leading DL approach in landmark localization, with the fine-tuned Gemma-3 27B model reaching Hit@2: 0.85 and Hit@1: 0.74 on synthetic data. Qualitative experiments demonstrated the MLLMs' ability to correct initial predictions through reasoning and sequentially navigate the C-arm towards a target location, indicating potential for intelligent assistance in surgical interventions.

Key takeaway

For AI Scientists developing medical imaging guidance systems, this research indicates that fine-tuned MLLMs offer a path beyond fixed image-to-motion mapping. You should consider integrating MLLMs to enable reasoning, incorporate clinician feedback, and achieve multi-step C-arm navigation, especially in critical procedures like stroke thrombectomy. While current DL models may offer higher initial precision, MLLMs provide crucial agentic capabilities for robust, adaptive control.

Key insights

Fine-tuned MLLMs can perform autonomous skeletal landmark localization and C-arm navigation with reasoning capabilities.

Principles

Anatomical spatial grounding improves MLLM understanding.
Fine-tuning MLLMs preserves general language abilities.
Hybrid DL-MLLM approaches can enhance precision and reasoning.

Method

MLLMs are fine-tuned using supervised learning with QLoRA on X-ray images paired with ordered closest skeletal landmarks, enabling spatial reasoning and C-arm navigation via chain-of-thought prompting.

In practice

Use synthetic DRRs for robust AI model training.
Apply QLoRA for efficient MLLM fine-tuning on medical images.
Integrate MLLMs for clinician feedback and iterative refinement.

Topics

C-arm Control
Skeletal Landmark Localization
Multimodal Large Language Models
Agentic AI
X-ray Imaging

Code references

Best for: AI Scientist, Research Scientist, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.