Now in Foundry: NVIDIA Nemotron-3-Super-120B-A12B, IBM Granite-4.0-1b-Speech, and Sarvam-105B

2026-03-30 · Source: Microsoft Foundry Blog articles · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing, Speech Recognition · Depth: Advanced, medium

Summary

Microsoft Foundry's Hugging Face collection now includes three new models: NVIDIA Nemotron-3-Super-120B-A12B, IBM Granite-4.0-1b-Speech, and Sarvam-105B. NVIDIA's Nemotron-3-Super-120B-A12B is a hybrid Latent Mixture-of-Experts (MoE) model with 12B active parameters, supporting up to 1 million tokens and featuring configurable reasoning and native speculative decoding. IBM Granite-4.0-1b-Speech is a compact ~1B parameter ASR/AST model achieving a 5.52% average Word Error Rate (WER) at 280× real-time speed, with runtime keyword biasing and bidirectional translation for six languages. Sarvam-105B is a 105B MoE model with 10.3B active parameters, optimized for 22 Indian languages and English, demonstrating strong agentic performance on web search and task-planning benchmarks.

Key takeaway

For AI Architects evaluating large language models for specialized applications, these new additions to Microsoft Foundry offer distinct advantages. Consider NVIDIA Nemotron-3-Super-120B-A12B for ultra-long context and agentic workflows requiring configurable reasoning. IBM Granite-4.0-1b-Speech is ideal for compact, high-speed multilingual ASR/AST with dynamic domain adaptation. Sarvam-105B provides robust agentic capabilities and broad Indian language support, crucial for diverse global deployments.

Key insights

New models in Microsoft Foundry offer specialized capabilities for long-context, multilingual speech, and agentic tasks.

Principles

Hybrid MoE architectures improve accuracy per parameter.
Runtime keyword biasing enables domain adaptation without fine-tuning.
Multi-token prediction reduces time-to-first-token.

Method

NVIDIA's Latent MoE architecture combines Mamba-2 state-space layers and sparse MoE layers with full attention, routing tokens to a smaller latent space for computation.

In practice

Use Nemotron-3-Super for 1M-token RAG and code analysis.
Employ Granite-4.0-1b-Speech for domain-specific ASR via keyword biasing.
Leverage Sarvam-105B for agentic workflows in 22 Indian languages.

Topics

NVIDIA Nemotron-3-Super-120B-A12B
IBM Granite-4.0-1b-Speech
Sarvam-105B
Mixture-of-Experts
Automatic Speech Recognition

Best for: AI Architect, AI Engineer, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Microsoft Foundry Blog articles.