Drive-KD: Multi-Teacher Distillation for VLMs in Autonomous Driving

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

Drive-KD is a multi-teacher knowledge distillation framework designed to enhance Vision-Language Models (VLMs) for autonomous driving, addressing high GPU memory demands and inference latency of large models. It decomposes driving into a "perception–reasoning–planning" triad and transfers these capabilities using layer-specific attention as distillation signals. The framework unifies single-teacher settings into a multi-teacher approach, introducing asymmetric gradient projection (AGP) to manage cross-capability gradient conflicts. Evaluations show that the distilled InternVL3-1B model, requiring ~42× less GPU memory and achieving ~11.4× higher throughput, outperforms the pretrained 78B model on DriveBench and surpasses GPT-5.1 in planning. Drive-KD demonstrates strong generalization across various model families and scales.

Key takeaway

For AI Scientists and Machine Learning Engineers developing efficient VLMs for autonomous driving, this research suggests moving beyond conventional supervised fine-tuning and large models. You should consider adopting Drive-KD's multi-teacher distillation framework, leveraging its capability-specific attention signals and asymmetric gradient projection. This approach enables smaller models like InternVL3-1B to achieve superior performance and efficiency, but rigorous closed-loop simulation and extensive real-world testing remain critical before any physical system deployment.

Key insights

Small VLMs can achieve superior autonomous driving performance and efficiency through multi-teacher knowledge distillation and targeted attention signals.

Principles

Method

Drive-KD systematically selects distillation layers and attention signals for perception (Layer 1 text-to-vision), reasoning (intermediate full attention with layer-group matching), and planning (penultimate layer text-to-vision), then unifies them with AGP.

In practice

Topics

Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.