I Gave Qwen3.7-Plus a Screenshot and It Found the Exact Pixel to Click for $0.40

2026-06-08 · Source: Towards AI - Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Intermediate, quick

Summary

Qwen3.7-Plus is a frontier-tier screen-grounding model capable of identifying exact pixel coordinates on a screenshot based on natural language instructions. For instance, it precisely located the "Launch instance" button at (x=1147, y=283) on an AWS console screenshot. This model is priced at \$0.40 per million input tokens, which is one-sixth the cost of Alibaba's text-only Qwen3.7-Max. It achieves a score of 79.0 on the ScreenSpot Pro benchmark, indicating its effectiveness for "computer use" agents. The model integrates easily, callable via the standard OpenAI SDK with just a four-line code modification, making advanced GUI grounding accessible for various applications including design mockups and live browser interactions.

Key takeaway

For AI Engineers developing "computer use" agents or automating GUI interactions, Qwen3.7-Plus offers a compelling solution. Its precise pixel-level grounding, demonstrated by a 79.0 ScreenSpot Pro score, combined with a \$0.40 per million token price, significantly lowers the barrier to entry for advanced agent capabilities. You should consider integrating this model via the OpenAI SDK to enhance your agents' ability to navigate and interact with complex graphical interfaces efficiently and affordably.

Key insights

Qwen3.7-Plus provides precise and affordable GUI grounding, enabling advanced "computer use" agents via a simple API.

Principles

GUI grounding is fundamental for "computer use" agents.
High accuracy and low cost can democratize advanced AI agent capabilities.
Standard SDK integration simplifies adoption.

Method

Provide a screenshot and natural language instruction to the model via the OpenAI SDK; it returns precise pixel coordinates for the target UI element.

In practice

Use Qwen3.7-Plus for automating UI interactions.
Integrate with existing OpenAI SDK workflows.
Test on design mockups or live browser environments.

Topics

GUI Grounding
Qwen3.7-Plus
AI Agents
OpenAI SDK
UI Automation
Large Multimodal Models

Best for: AI Engineer, Machine Learning Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards AI - Medium.