Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7

· Source: Simon Willison's Weblog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Intermediate, quick

Summary

On April 16, 2026, a comparison of image generation capabilities between Alibaba's Qwen3.6-35B-A3B and Anthropic's Claude Opus 4.7 revealed unexpected results. Using a 20.9GB quantized version of Qwen3.6-35B-A3B-UD-Q4_K_S.gguf running locally on a MacBook Pro M5 via LM Studio, the Qwen model produced a superior image of a "pelican riding a bicycle" compared to Claude Opus 4.7, which struggled with the bicycle's frame. A subsequent test involving a "flamingo riding a unicycle" also favored Qwen, which generated a more charismatic and detailed SVG. Despite the "pelican benchmark" being a humorous, informal test, its historical correlation with general model utility appears to be breaking, as a smaller, locally run model outperformed a major proprietary release.

Key takeaway

For AI engineers evaluating image generation models, do not solely rely on general utility benchmarks. Your teams should consider testing smaller, quantized models like Qwen3.6-35B-A3B for specific creative outputs, especially when local inference is a priority. This approach might yield superior results for niche tasks compared to larger, proprietary models, challenging assumptions about model hierarchy.

Key insights

Local, quantized models can surprisingly outperform larger proprietary models in specific creative tasks.

Principles

In practice

Topics

Code references

Best for: AI Engineer, Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Prompt Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Simon Willison's Weblog.