LiteVLA-H: Dual-Rate Vision-Language-Action Inference for Onboard Aerial Guidance and Semantic Perception
Summary
LiteVLA-H is a compact 256M-parameter Vision-Language-Action (VLA) system designed for dual-rate operation on an NVIDIA Jetson AGX Orin, addressing the challenge of low-latency closed-loop guidance for drones under strict onboard compute and communication constraints. The system features a fast outer-loop guidance mode, issuing reactive action tokens at 50.65 ms (19.74 Hz), and a slower semantic mode for scene understanding and narration, supporting sentence-level outputs at 149.90–164.57 ms (6.08–6.67 Hz). This dual-rate approach is motivated by the empirical observation that end-to-end latency in this compact edge regime is dominated by multimodal pre-fill rather than token decoding. LiteVLA-H employs a knowledge-preserving fine-tuning recipe that combines reactive flight data, aerial semantic data, and generic caption/VQA supervision to specialize the model without compromising its descriptive capabilities. The system is positioned against state-of-the-art architectures like AnywhereVLA, FutureVLA, and ReMem-VLA, demonstrating a higher edge inference rate for its action branch while maintaining periodic semantic awareness.
Key takeaway
For AI Engineers developing VLA systems for aerial robotics, you should prioritize optimizing multimodal pre-fill latency over token decoding for compact edge deployments. Implement a dual-rate scheduler to achieve high-frequency reactive guidance (e.g., 20 Hz) while concurrently supporting lower-frequency semantic awareness (e.g., 6-7 Hz) on the same embedded platform, such as the Jetson AGX Orin, to balance responsiveness and interpretability.
Key insights
Edge VLA latency is pre-fill dominant, necessitating dual-rate scheduling for reactive guidance and semantic awareness.
Principles
- Separate fast action from slow semantic processing.
- Pre-fill cost dominates short-output edge VLA latency.
- Mixed fine-tuning preserves general VLA competence.
Method
LiteVLA-H uses a dual-rate scheduler for 19.74 Hz action tokens and 6.08–6.67 Hz semantic outputs. Training involves a weighted mixture objective with action, aerial semantic, generic caption/VQA losses, and a knowledge-preserving regularizer.
In practice
- Deploy deadline-first scheduling for VLA actions.
- Prioritize pre-fill optimization for faster VLA reaction.
- Use mixed data for VLA fine-tuning to retain capabilities.
Topics
- Vision-Language-Action Models
- Aerial Robotics
- Edge AI
- Dual-Rate Inference
- Jetson AGX Orin
Best for: AI Engineer, Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.