How NVIDIA Extreme Hardware-Software Co-Design Delivered a Large Inference Boost for Sarvam AI’s Sovereign Models

2026-02-18 · Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, long

Summary

Sarvam AI, a generative AI startup in Bengaluru, India, collaborated with NVIDIA to optimize its Sovereign 30B large language model for production deployment, targeting high throughput and low latency for multilingual, multimodal applications. The joint effort achieved a 4x speedup in inference performance on NVIDIA Blackwell over baseline NVIDIA H100 GPUs. This was accomplished through a combination of kernel and scheduling optimizations on NVIDIA H100 SXM GPUs, yielding a 2x speedup, and further enhanced by Blackwell's compute capabilities and NVFP4 weight quantization for an additional 2x gain. Sarvam AI's models, including 3B, 30B, and 100B versions, utilize a mixture-of-experts (MoE) architecture and support 22 Indian languages, English, math, and code, trained using NVIDIA Nemotron libraries and the NeMo Framework.

Key takeaway

For AI Engineers deploying large language models with strict latency and cost requirements, consider a holistic optimization approach. By co-designing model architecture with hardware, kernel, and scheduling strategies, you can achieve substantial inference speedups, as demonstrated by Sarvam AI's 4x performance gain. Prioritize profiling tools like NVIDIA Nsight Systems to identify bottlenecks and explore techniques like kernel fusion, mixed batching, and disaggregated serving to maximize throughput and meet critical P95 SLAs.

Key insights

Full-stack co-optimization of model design, kernels, scheduling, and hardware significantly boosts LLM inference performance.

Principles

MoE architectures can scale intelligence efficiently.
P95 latency metrics are critical for real-world user experience.
Disaggregated serving can outperform aggregated memory capacity.

Method

Optimize MoE LLM inference by fusing kernel operations, implementing mixed prefill/decode scheduling, and utilizing disaggregated serving (1P+1D) on NVIDIA H100 SXM, then quantize to NVFP4 for Blackwell GPUs.

In practice

Use NVIDIA Nsight Systems to profile kernel bottlenecks.
Implement Fused TopK and fused QK norm + RoPE kernels.
Experiment with mixed batching for prefill and decode.

Topics

LLM Inference Optimization
NVIDIA Blackwell Architecture
Mixture-of-Experts
Sovereign AI Models
GPU Acceleration

Code references

NVIDIA/Model-Optimizer

Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.