The last six months in LLMs in five minutes

2026-05-19 · Source: Simon Willison's Weblog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Emerging Technologies & Innovation · Depth: Intermediate, medium

Summary

The last six months in Large Language Models (LLMs), from November 2025 to April 2026, marked a significant inflection point, particularly for coding applications. November 2025 saw the "best" model title shift five times among Anthropic, OpenAI, and Google, with models like Claude Sonnet 4.5, GPT-5.1, Gemini 3, GPT-5.1 Codex Max, and Claude Opus 4.5. Crucially, coding agents, enhanced by Reinforcement Learning from Verifiable Rewards, transitioned from "often-work" to "mostly-work," becoming reliable daily tools. February 2026 saw the "Warelay" project evolve into "OpenClaw," a personal AI assistant, popularizing the generic term "Claws" and driving Mac Mini sales. February also brought Gemini 3.1 Pro, demonstrating remarkable image generation capabilities. April 2026 introduced Google's Gemma 4 series and GLM-5.1, a 1.5TB open-weight model, alongside Qwen3.6-35B-A3B, a 20.9GB open-weight model that outperformed Claude Opus 4.7 on specific benchmarks. The period's key themes are the dramatic improvement of coding agents and the unexpected performance of local, laptop-run models.

Key takeaway

For AI Engineers evaluating LLM integration, the rapid evolution from November 2025 to April 2026 means continuous re-evaluation of model choices is critical. You should now consider advanced coding agents as reliable daily drivers, significantly boosting productivity. Furthermore, explore capable open-weight models like Gemma 4 or Qwen3.6-35B-A3B for local deployment, as their performance now wildly outperforms prior expectations, potentially reducing reliance on costly cloud APIs.

Key insights

LLM capabilities, particularly coding agents and local models, advanced rapidly from November 2025 to April 2026, marked by frequent "best" model shifts.

Principles

LLM performance leadership is highly volatile, changing hands frequently.
Specialized training like RL from Verifiable Rewards enhances agent quality.
Unconventional benchmarks reveal true model generalization.

Method

Reinforcement Learning from Verifiable Rewards (RLVR) was applied to increase code quality in LLM agents.

In practice

Integrate advanced coding agents into daily development workflows.
Evaluate open-weight models like Gemma 4 or Qwen3.6-35B-A3B for local deployment.
Consider dedicated hardware like Mac Minis for running personal AI assistants.

Topics

Large Language Models
Coding Agents
Open-weight Models
Model Benchmarking
Personal AI Assistants
Reinforcement Learning from Verifiable Rewards

Code references

Best for: NLP Engineer, CTO, VP of Engineering/Data, AI Engineer, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Simon Willison's Weblog.