How NVIDIA Extreme Hardware-Software Co-Design Delivered a Large Inference Boost for Sarvam AI’s Sovereign Models

· Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, long

Summary

Sarvam AI, a generative AI startup in Bengaluru, India, collaborated with NVIDIA to optimize its Sovereign 30B large language model for production deployment, targeting high throughput and low latency for multilingual, multimodal applications. The joint effort achieved a 4x speedup in inference performance on NVIDIA Blackwell over baseline NVIDIA H100 GPUs. This was accomplished through a combination of kernel and scheduling optimizations on NVIDIA H100 SXM GPUs, yielding a 2x speedup, and further enhanced by Blackwell's compute capabilities and NVFP4 weight quantization for an additional 2x gain. Sarvam AI's models, including 3B, 30B, and 100B versions, utilize a mixture-of-experts (MoE) architecture and support 22 Indian languages, English, math, and code, trained using NVIDIA Nemotron libraries and the NeMo Framework.

Key takeaway

For AI Engineers deploying large language models with strict latency and cost requirements, consider a holistic optimization approach. By co-designing model architecture with hardware, kernel, and scheduling strategies, you can achieve substantial inference speedups, as demonstrated by Sarvam AI's 4x performance gain. Prioritize profiling tools like NVIDIA Nsight Systems to identify bottlenecks and explore techniques like kernel fusion, mixed batching, and disaggregated serving to maximize throughput and meet critical P95 SLAs.

Key insights

Full-stack co-optimization of model design, kernels, scheduling, and hardware significantly boosts LLM inference performance.

Principles

Method

Optimize MoE LLM inference by fusing kernel operations, implementing mixed prefill/decode scheduling, and utilizing disaggregated serving (1P+1D) on NVIDIA H100 SXM, then quantize to NVFP4 for Blackwell GPUs.

In practice

Topics

Code references

Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.