Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low-Latency, Small-Batch, On-Device Physical-AI Serving

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

Execution-state capsules introduce a novel graph-bound checkpoint and restore mechanism designed for low-latency, small-batch, on-device physical-AI serving. Unlike mainstream LLM serving systems that primarily reuse prefix work through paged or radix key-value (KV) caches for high-throughput, capsules manage the complete restorable state at a committed boundary. This includes KV, recurrent state, convolution state, MTP state, and metadata, enabling full execution boundary snapshot, restore, fork, or rollback. FlashRT, a white-box kernel runtime with an NVIDIA CUDA backend, implements this using contiguous static buffers. On an RTX 5090, GPU-resident snapshot and restore operations are sub-millisecond, achieving a Time-To-First-Token (TTFT) speedup over cold prefill from 3.9x at 2k tokens to 27x at 16k tokens. A KV-only ablation demonstrated that recurrent state is load-bearing, as it caused divergence. The system's correctness and structural properties also hold on Jetson AGX Thor and DGX Spark, positioning capsules as a complementary latency-first solution for explicit execution-state reuse, rather than a replacement for high-throughput KV-cache serving.

Key takeaway

For AI Engineers developing interactive LLM agents or robot policies for on-device deployment, you should consider integrating execution-state capsules to achieve sub-millisecond context switching and significant Time-To-First-Token speedups. This approach allows your systems to rapidly branch, reset, or re-enter execution states by fully restoring KV, recurrent, and convolution states, crucial for tight responsiveness budgets. Evaluate FlashRT-like mechanisms to move beyond KV-only reuse, especially where latency is paramount.

Key insights

Execution-state capsules enable full execution state checkpoint/restore for low-latency, on-device AI, complementing KV-cache reuse.

Principles

Method

FlashRT, a white-box kernel runtime, captures graph plans over contiguous static buffers to snapshot and restore complete execution boundaries, including KV and recurrent state.

In practice

Topics

Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.