Develop Native Multimodal Agents with Qwen3.5 VLM Using NVIDIA GPU-Accelerated Endpoints
Summary
Alibaba has released the Qwen3.5 series, an open-source vision-language model (VLM) designed for native multimodal agents. The initial model in this series is a ~400B parameter VLM that integrates a hybrid architecture combining Mixture of Experts (MoE) and Gated Delta Networks. Qwen3.5 demonstrates enhanced capabilities in understanding and navigating user interfaces, surpassing previous VLM generations. NVIDIA provides free access to GPU-accelerated endpoints for Qwen3.5 on build.nvidia.com, powered by NVIDIA Blackwell GPUs, allowing developers to experiment with prompts and test the model. The model supports various applications, including coding, visual reasoning for mobile and web interfaces, chat, and complex search. NVIDIA NIM offers containerized inference microservices for production deployment, and the NVIDIA NeMo framework facilitates fine-tuning for specialized domain needs.
Key takeaway
For AI Engineers evaluating multimodal models for agentic workflows, Qwen3.5 offers a robust open-source option with strong UI understanding. You should explore its capabilities on NVIDIA's free build.nvidia.com endpoints and consider using NVIDIA NeMo for domain-specific fine-tuning to optimize performance for your specialized applications.
Key insights
Qwen3.5 is a ~400B parameter open-source VLM with a hybrid MoE and Gated Delta Network architecture, excelling in UI navigation.
Principles
- Hybrid architectures enhance VLM capabilities.
- Fine-tuning adapts models for specialized domains.
Method
Developers can fine-tune Qwen3.5's 397B-parameter architecture using the PyTorch-native NeMo Automodel library, supporting full supervised fine-tuning or memory-efficient methods like LoRA.
In practice
- Test Qwen3.5 on NVIDIA's free GPU-accelerated endpoints.
- Use NVIDIA NIM for containerized production deployment.
- Fine-tune Qwen3.5 for medical visual QA with NeMo.
Topics
- Qwen3.5
- Vision-Language Models
- Mixture-of-Experts
- NVIDIA NeMo
- Multimodal Agents
Code references
Best for: AI Engineer, Machine Learning Engineer, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.