Run Step 3.7 Flash on NVIDIA GPUs with Enterprise-Ready Multimodal AI

2026-05-28 · Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Robotics & Autonomous Systems · Depth: Advanced, medium

Summary

Step 3.7 Flash, the latest 198B-parameter Mixture-of-Experts vision-language model from StepFun, is now available on NVIDIA-accelerated infrastructure for production-scale multimodal AI applications. Optimized for agentic workflows, it activates approximately 11B parameters per forward pass, features native image and video input, three configurable reasoning levels, and a 256k context window. This model is designed for enterprise use cases like financial analysis and concurrent coding agents. Developers can utilize StepFun's NVFP4-quantized checkpoint via Hugging Face for boosted inference. Deployment is supported by open-source frameworks such as SGLang, NVIDIA TensorRT-LLM, and vLLM, leveraging NVIDIA hardware optimizations. NVIDIA also offers GPU-accelerated endpoints for prototyping and evaluation, including a document intelligence pipeline demo using Step 3.7 Flash and NVIDIA Nemotron Parse. Production deployment is streamlined with NVIDIA NIM microservices, and Day 0 fine-tuning is possible using the NVIDIA NeMo framework, supporting SFT and LoRA at 600 tokens/sec on Hopper GPUs.

Key takeaway

For AI Engineers building multimodal enterprise applications, you should consider Step 3.7 Flash for its production-ready capabilities on NVIDIA GPUs. Its 256k context window and configurable reasoning levels support complex agentic workflows. Utilize NVIDIA NIM for streamlined deployment and NeMo for Day 0 fine-tuning to adapt the model to your specific domain data. Explore the document intelligence pipeline demo to see its practical application in extracting structured insights from complex documents.

Key insights

Step 3.7 Flash is an enterprise-ready multimodal MoE model optimized for NVIDIA infrastructure.

Principles

Multimodal AI enables real-time perception and reasoning.
MoE models balance performance with efficiency.
Quantization boosts inference performance.

Method

Deploy Step 3.7 Flash using NVIDIA NIM containers, start a server with an OpenAI client, and send text or image input to the endpoint for inference. Fine-tune with NeMo Automodel.

In practice

Use NVFP4-quantized checkpoint for memory-efficient inference.
Integrate with NVIDIA Nemotron Parse for document intelligence.
Fine-tune with NeMo for domain-specific customization.

Topics

Step 3.7 Flash
Vision-Language Models
NVIDIA GPUs
Document Intelligence
NVIDIA NIM
NeMo Framework

Code references

Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.