Drive-KD: Multi-Teacher Distillation for VLMs in Autonomous Driving

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

Drive-KD is a multi-teacher knowledge distillation framework designed to enhance Vision-Language Models (VLMs) for autonomous driving, addressing high GPU memory demands and inference latency of large models. It decomposes driving into a "perception–reasoning–planning" triad and transfers these capabilities using layer-specific attention as distillation signals. The framework unifies single-teacher settings into a multi-teacher approach, introducing asymmetric gradient projection (AGP) to manage cross-capability gradient conflicts. Evaluations show that the distilled InternVL3-1B model, requiring ~42× less GPU memory and achieving ~11.4× higher throughput, outperforms the pretrained 78B model on DriveBench and surpasses GPT-5.1 in planning. Drive-KD demonstrates strong generalization across various model families and scales.

Key takeaway

For AI Scientists and Machine Learning Engineers developing efficient VLMs for autonomous driving, this research suggests moving beyond conventional supervised fine-tuning and large models. You should consider adopting Drive-KD's multi-teacher distillation framework, leveraging its capability-specific attention signals and asymmetric gradient projection. This approach enables smaller models like InternVL3-1B to achieve superior performance and efficiency, but rigorous closed-loop simulation and extensive real-world testing remain critical before any physical system deployment.

Key insights

Small VLMs can achieve superior autonomous driving performance and efficiency through multi-teacher knowledge distillation and targeted attention signals.

Principles

Autonomous driving capabilities (perception, reasoning, planning) benefit from sequential decomposition.
Layer-specific attention is a more stable distillation signal than hidden states or output distributions.
Asymmetric gradient projection (AGP) effectively mitigates conflicts in multi-objective distillation.

Method

Drive-KD systematically selects distillation layers and attention signals for perception (Layer 1 text-to-vision), reasoning (intermediate full attention with layer-group matching), and planning (penultimate layer text-to-vision), then unifies them with AGP.

In practice

Distill Layer 1 cross-modal attention for perception tasks.
Use penultimate-layer cross-modal attention for planning tasks.
Employ asymmetric gradient projection to manage multi-task learning conflicts.

Topics

Autonomous Driving
Vision-Language Models
Knowledge Distillation
Multi-Teacher Distillation
Asymmetric Gradient Projection
Model Efficiency

Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.