LiteVLA-H: Dual-Rate Vision-Language-Action Inference for Onboard Aerial Guidance and Semantic Perception

2026-05-05 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Emerging Technologies & Innovation · Depth: Expert, long

Summary

LiteVLA-H is a compact 256M-parameter Vision-Language-Action (VLA) system designed for dual-rate operation on an NVIDIA Jetson AGX Orin, addressing the challenge of low-latency closed-loop guidance for drones under strict onboard compute and communication constraints. The system features a fast outer-loop guidance mode, issuing reactive action tokens at 50.65 ms (19.74 Hz), and a slower semantic mode for scene understanding and narration, supporting sentence-level outputs at 149.90–164.57 ms (6.08–6.67 Hz). This dual-rate approach is motivated by the empirical observation that end-to-end latency in this compact edge regime is dominated by multimodal pre-fill rather than token decoding. LiteVLA-H employs a knowledge-preserving fine-tuning recipe that combines reactive flight data, aerial semantic data, and generic caption/VQA supervision to specialize the model without compromising its descriptive capabilities. The system is positioned against state-of-the-art architectures like AnywhereVLA, FutureVLA, and ReMem-VLA, demonstrating a higher edge inference rate for its action branch while maintaining periodic semantic awareness.

Key takeaway

For AI Engineers developing VLA systems for aerial robotics, you should prioritize optimizing multimodal pre-fill latency over token decoding for compact edge deployments. Implement a dual-rate scheduler to achieve high-frequency reactive guidance (e.g., 20 Hz) while concurrently supporting lower-frequency semantic awareness (e.g., 6-7 Hz) on the same embedded platform, such as the Jetson AGX Orin, to balance responsiveness and interpretability.

Key insights

Edge VLA latency is pre-fill dominant, necessitating dual-rate scheduling for reactive guidance and semantic awareness.

Principles

Separate fast action from slow semantic processing.
Pre-fill cost dominates short-output edge VLA latency.
Mixed fine-tuning preserves general VLA competence.

Method

LiteVLA-H uses a dual-rate scheduler for 19.74 Hz action tokens and 6.08–6.67 Hz semantic outputs. Training involves a weighted mixture objective with action, aerial semantic, generic caption/VQA losses, and a knowledge-preserving regularizer.

In practice

Deploy deadline-first scheduling for VLA actions.
Prioritize pre-fill optimization for faster VLA reaction.
Use mixed data for VLA fine-tuning to retain capabilities.

Topics

Vision-Language-Action Models
Aerial Robotics
Edge AI
Dual-Rate Inference
Jetson AGX Orin

Best for: AI Engineer, Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.