Run Step 3.7 Flash on NVIDIA GPUs with Enterprise-Ready Multimodal AI

· Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Robotics & Autonomous Systems · Depth: Advanced, medium

Summary

Step 3.7 Flash, the latest 198B-parameter Mixture-of-Experts vision-language model from StepFun, is now available on NVIDIA-accelerated infrastructure for production-scale multimodal AI applications. Optimized for agentic workflows, it activates approximately 11B parameters per forward pass, features native image and video input, three configurable reasoning levels, and a 256k context window. This model is designed for enterprise use cases like financial analysis and concurrent coding agents. Developers can utilize StepFun's NVFP4-quantized checkpoint via Hugging Face for boosted inference. Deployment is supported by open-source frameworks such as SGLang, NVIDIA TensorRT-LLM, and vLLM, leveraging NVIDIA hardware optimizations. NVIDIA also offers GPU-accelerated endpoints for prototyping and evaluation, including a document intelligence pipeline demo using Step 3.7 Flash and NVIDIA Nemotron Parse. Production deployment is streamlined with NVIDIA NIM microservices, and Day 0 fine-tuning is possible using the NVIDIA NeMo framework, supporting SFT and LoRA at 600 tokens/sec on Hopper GPUs.

Key takeaway

For AI Engineers building multimodal enterprise applications, you should consider Step 3.7 Flash for its production-ready capabilities on NVIDIA GPUs. Its 256k context window and configurable reasoning levels support complex agentic workflows. Utilize NVIDIA NIM for streamlined deployment and NeMo for Day 0 fine-tuning to adapt the model to your specific domain data. Explore the document intelligence pipeline demo to see its practical application in extracting structured insights from complex documents.

Key insights

Step 3.7 Flash is an enterprise-ready multimodal MoE model optimized for NVIDIA infrastructure.

Principles

Method

Deploy Step 3.7 Flash using NVIDIA NIM containers, start a server with an OpenAI client, and send text or image input to the endpoint for inference. Fine-tune with NeMo Automodel.

In practice

Topics

Code references

Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.