How NVIDIA Extreme Hardware-Software Co-Design Delivered a Large Inference Boost for Sarvam AI’s Sovereign Models
Summary
Sarvam AI, a generative AI startup in Bengaluru, India, collaborated with NVIDIA to optimize its Sovereign 30B large language model for production deployment, targeting high throughput and low latency for multilingual, multimodal applications. The joint effort achieved a 4x speedup in inference performance on NVIDIA Blackwell over baseline NVIDIA H100 GPUs. This was accomplished through a combination of kernel and scheduling optimizations on NVIDIA H100 SXM GPUs, yielding a 2x speedup, and further enhanced by Blackwell's compute capabilities and NVFP4 weight quantization for an additional 2x gain. Sarvam AI's models, including 3B, 30B, and 100B versions, utilize a mixture-of-experts (MoE) architecture and support 22 Indian languages, English, math, and code, trained using NVIDIA Nemotron libraries and the NeMo Framework.
Key takeaway
For AI Engineers deploying large language models with strict latency and cost requirements, consider a holistic optimization approach. By co-designing model architecture with hardware, kernel, and scheduling strategies, you can achieve substantial inference speedups, as demonstrated by Sarvam AI's 4x performance gain. Prioritize profiling tools like NVIDIA Nsight Systems to identify bottlenecks and explore techniques like kernel fusion, mixed batching, and disaggregated serving to maximize throughput and meet critical P95 SLAs.
Key insights
Full-stack co-optimization of model design, kernels, scheduling, and hardware significantly boosts LLM inference performance.
Principles
- MoE architectures can scale intelligence efficiently.
- P95 latency metrics are critical for real-world user experience.
- Disaggregated serving can outperform aggregated memory capacity.
Method
Optimize MoE LLM inference by fusing kernel operations, implementing mixed prefill/decode scheduling, and utilizing disaggregated serving (1P+1D) on NVIDIA H100 SXM, then quantize to NVFP4 for Blackwell GPUs.
In practice
- Use NVIDIA Nsight Systems to profile kernel bottlenecks.
- Implement Fused TopK and fused QK norm + RoPE kernels.
- Experiment with mixed batching for prefill and decode.
Topics
- LLM Inference Optimization
- NVIDIA Blackwell Architecture
- Mixture-of-Experts
- Sovereign AI Models
- GPU Acceleration
Code references
Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.