I Gave Qwen3.7-Plus a Screenshot and It Found the Exact Pixel to Click for $0.40
Summary
Qwen3.7-Plus is a frontier-tier screen-grounding model capable of identifying exact pixel coordinates on a screenshot based on natural language instructions. For instance, it precisely located the "Launch instance" button at (x=1147, y=283) on an AWS console screenshot. This model is priced at \$0.40 per million input tokens, which is one-sixth the cost of Alibaba's text-only Qwen3.7-Max. It achieves a score of 79.0 on the ScreenSpot Pro benchmark, indicating its effectiveness for "computer use" agents. The model integrates easily, callable via the standard OpenAI SDK with just a four-line code modification, making advanced GUI grounding accessible for various applications including design mockups and live browser interactions.
Key takeaway
For AI Engineers developing "computer use" agents or automating GUI interactions, Qwen3.7-Plus offers a compelling solution. Its precise pixel-level grounding, demonstrated by a 79.0 ScreenSpot Pro score, combined with a \$0.40 per million token price, significantly lowers the barrier to entry for advanced agent capabilities. You should consider integrating this model via the OpenAI SDK to enhance your agents' ability to navigate and interact with complex graphical interfaces efficiently and affordably.
Key insights
Qwen3.7-Plus provides precise and affordable GUI grounding, enabling advanced "computer use" agents via a simple API.
Principles
- GUI grounding is fundamental for "computer use" agents.
- High accuracy and low cost can democratize advanced AI agent capabilities.
- Standard SDK integration simplifies adoption.
Method
Provide a screenshot and natural language instruction to the model via the OpenAI SDK; it returns precise pixel coordinates for the target UI element.
In practice
- Use Qwen3.7-Plus for automating UI interactions.
- Integrate with existing OpenAI SDK workflows.
- Test on design mockups or live browser environments.
Topics
- GUI Grounding
- Qwen3.7-Plus
- AI Agents
- OpenAI SDK
- UI Automation
- Large Multimodal Models
Best for: AI Engineer, Machine Learning Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards AI - Medium.