PersonaDrive: Human-Style Retrieval-Augmented VLA Agents for Closed-Loop Driving Simulation

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Software Development & Engineering · Depth: Expert, extended

Summary

PersonaDrive introduces a novel pipeline for vision-language-action (VLA) driving agents, enabling human-style, style-diverse non-ego traffic in closed-loop simulations. It conditions agents on retrieved demonstrations from a style-instructed human driving dataset, collected from 8 participants driving CARLA leaderboard routes under aggressive, neutral, and conservative instructions. The three-stage pipeline involves offline triplet mining, training a lightweight retrieval head, and fine-tuning a single VLA backbone to interpret retrieved context as in-context behavioral demonstrations. This design allows style switching at inference by querying different per-style databases, eliminating per-style retraining. On Bench2Drive, PersonaDrive (no style) improves the driving score by 4.6% over SimLingo and 2.5% over HiP-AD. Under style conditioning, it attains the highest driving score in every style, with its weakest style surpassing the strongest baseline, DMW, by 5.4%, and average speed/acceleration increasing by 18% and 25% from conservative to aggressive instructions.

Key takeaway

For autonomous driving developers aiming to enhance closed-loop simulation fidelity with diverse, human-style traffic, PersonaDrive offers a compelling solution. You can efficiently generate agents exhibiting aggressive, neutral, or conservative behaviors by utilizing retrieval-augmented VLA models and per-style demonstration databases. This approach avoids costly per-style model retraining, allowing you to rapidly scale behavioral diversity and improve ego policy generalization in complex interaction scenarios. Consider adopting this method to create more realistic and challenging simulation environments.

Key insights

PersonaDrive enables style-diverse VLA driving agents by conditioning them on retrieved human demonstrations, eliminating per-style retraining.

Principles

Human driving patterns yield more realistic agent behavior.
Decoupling style from driver identity enables class-level supervision.
Retrieval-augmented learning allows style modulation without model retraining.

Method

The pipeline involves offline triplet mining using image-text similarity, training a lightweight retrieval head fusing visual and control features, and fine-tuning a VLA backbone to interpret retrieved context as in-context demonstrations for waypoint prediction.

In practice

Employ per-style FAISS indices for efficient style switching.
Bundle control history and waypoints in context points.
Subsample frames (e.g., stride-of-5) for effective triplet mining.

Topics

PersonaDrive
VLA Agents
Retrieval-Augmented Learning
Driving Simulation
Human Driving Styles
CARLA Benchmarks

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.