Microsoft at NSDI 2026: Advances in large-scale networked systems

2026-05-05 · Source: Microsoft Research · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Expert, medium

Summary

Microsoft authors and collaborators contributed 11 accepted papers to the USENIX Symposium on Networked Systems Design and Implementation 2026 (NSDI '26), a key forum for advances in large-scale networked systems. These contributions span datacenter and wide-area networks, AI systems, and cloud infrastructure. Notable papers include "DroidSpeak," which enables KV cache sharing across fine-tuned LLM variants for up to 4x higher throughput; "Eywa," an LLM-based tool for automating model-based testing that found 33 bugs in network protocols; and "Octopus," a switch-free design for CXL memory pods that achieves 3.2x faster RPCs than in-rack RDMA. Other research covers topics like traffic engineering with probabilistic link capacities, video analytics with vision-language models, SmartNIC-enabled VM live migration, throughput-optimal collective communications, heuristic analysis from source code, harvesting spare CPU resources in containers, offloading cloud network services with SONiC DASH SmartSwitch, and fine-grained eBPF isolation.

Key takeaway

For MLOps Engineers and AI Infrastructure Architects managing large-scale deployments, these NSDI '26 papers highlight critical advancements. Consider integrating KV cache sharing techniques like DroidSpeak to boost LLM throughput, or explore SmartNIC-enabled live migration with Pyrocumulus for storage-optimized VMs to enhance operational efficiency and reduce downtime. Evaluating solutions like SONiC DASH SmartSwitch for cloud network offloading could significantly improve power and space efficiency in your data centers.

Key insights

Microsoft research at NSDI '26 advances large-scale networked systems, AI, and cloud infrastructure.

Principles

Optimize resource sharing for LLM efficiency.
Automate testing with AI for bug detection.
Enhance memory disaggregation for cost and speed.

Method

DroidSpeak shares KV caches across LLMs; Eywa uses LLMs to build protocol models for testing; Octopus employs a switch-free design for CXL memory pods; KRAKENGUARD uses symbolic execution for eBPF isolation.

In practice

Implement KV cache sharing for LLM serving.
Apply LLM-based model testing for network protocols.
Explore CXL memory pod designs for data centers.

Topics

Large-scale Networked Systems
Cloud Infrastructure
AI Systems Optimization
CXL Memory Pods
eBPF Isolation

Best for: MLOps Engineer, AI Engineer, Machine Learning Engineer, AI Scientist, Research Scientist, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Microsoft Research.