Develop Native Multimodal Agents with Qwen3.5 VLM Using NVIDIA GPU-Accelerated Endpoints

· Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Robotics & Autonomous Systems · Depth: Intermediate, quick

Summary

Alibaba has released the Qwen3.5 series, an open-source vision-language model (VLM) designed for native multimodal agents. The initial model in this series is a ~400B parameter VLM that integrates a hybrid architecture combining Mixture of Experts (MoE) and Gated Delta Networks. Qwen3.5 demonstrates enhanced capabilities in understanding and navigating user interfaces, surpassing previous VLM generations. NVIDIA provides free access to GPU-accelerated endpoints for Qwen3.5 on build.nvidia.com, powered by NVIDIA Blackwell GPUs, allowing developers to experiment with prompts and test the model. The model supports various applications, including coding, visual reasoning for mobile and web interfaces, chat, and complex search. NVIDIA NIM offers containerized inference microservices for production deployment, and the NVIDIA NeMo framework facilitates fine-tuning for specialized domain needs.

Key takeaway

For AI Engineers evaluating multimodal models for agentic workflows, Qwen3.5 offers a robust open-source option with strong UI understanding. You should explore its capabilities on NVIDIA's free build.nvidia.com endpoints and consider using NVIDIA NeMo for domain-specific fine-tuning to optimize performance for your specialized applications.

Key insights

Qwen3.5 is a ~400B parameter open-source VLM with a hybrid MoE and Gated Delta Network architecture, excelling in UI navigation.

Principles

Method

Developers can fine-tune Qwen3.5's 397B-parameter architecture using the PyTorch-native NeMo Automodel library, supporting full supervised fine-tuning or memory-efficient methods like LoRA.

In practice

Topics

Code references

Best for: AI Engineer, Machine Learning Engineer, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.