Microsoft at NSDI 2026: Advances in large-scale networked systems
Summary
Microsoft authors and collaborators contributed 11 accepted papers to the USENIX Symposium on Networked Systems Design and Implementation 2026 (NSDI '26), a key forum for advances in large-scale networked systems. These contributions span datacenter and wide-area networks, AI systems, and cloud infrastructure. Notable papers include "DroidSpeak," which enables KV cache sharing across fine-tuned LLM variants for up to 4x higher throughput; "Eywa," an LLM-based tool for automating model-based testing that found 33 bugs in network protocols; and "Octopus," a switch-free design for CXL memory pods that achieves 3.2x faster RPCs than in-rack RDMA. Other research covers topics like traffic engineering with probabilistic link capacities, video analytics with vision-language models, SmartNIC-enabled VM live migration, throughput-optimal collective communications, heuristic analysis from source code, harvesting spare CPU resources in containers, offloading cloud network services with SONiC DASH SmartSwitch, and fine-grained eBPF isolation.
Key takeaway
For MLOps Engineers and AI Infrastructure Architects managing large-scale deployments, these NSDI '26 papers highlight critical advancements. Consider integrating KV cache sharing techniques like DroidSpeak to boost LLM throughput, or explore SmartNIC-enabled live migration with Pyrocumulus for storage-optimized VMs to enhance operational efficiency and reduce downtime. Evaluating solutions like SONiC DASH SmartSwitch for cloud network offloading could significantly improve power and space efficiency in your data centers.
Key insights
Microsoft research at NSDI '26 advances large-scale networked systems, AI, and cloud infrastructure.
Principles
- Optimize resource sharing for LLM efficiency.
- Automate testing with AI for bug detection.
- Enhance memory disaggregation for cost and speed.
Method
DroidSpeak shares KV caches across LLMs; Eywa uses LLMs to build protocol models for testing; Octopus employs a switch-free design for CXL memory pods; KRAKENGUARD uses symbolic execution for eBPF isolation.
In practice
- Implement KV cache sharing for LLM serving.
- Apply LLM-based model testing for network protocols.
- Explore CXL memory pod designs for data centers.
Topics
- Large-scale Networked Systems
- Cloud Infrastructure
- AI Systems Optimization
- CXL Memory Pods
- eBPF Isolation
Best for: MLOps Engineer, AI Engineer, Machine Learning Engineer, AI Scientist, Research Scientist, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Microsoft Research.