Latest open artifacts (#21): Open model bonanza! Gemma 4, DeepSeek V4, Kimi K2.6, MiMo 2.5, GLM-5.1 & others. On CAISI's V4 assessment.

2026-05-16 · Source: Interconnects AI · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Emerging Technologies & Innovation · Depth: Advanced, short

Summary

The Center for AI Standards and Innovation (CAISI) evaluated new open frontier models, including DeepSeek V4 Pro, concluding that open models generally lag behind American frontier models, with the performance gap widening. CAISI utilized an Elo score based on Item Response Theory (IRT) across nine benchmarks, including CTF-Archive-Diamond, PortBench, and ARC-AGI-2, to assess capabilities. While CAISI's report indicates a significant gap, Epoch AI's ECI, also using IRT, suggests the gap remains consistently between 3-7 months. However, both evaluation methods are criticized for using simplified setups, such as fixed token budgets for coding tasks, which may not accurately reflect real-world model performance. The article also highlights recent open model releases like XiaomiMiMo's MiMo-V2.5-Pro, Google's Gemma-4 series (now Apache 2.0 licensed), moonshotai's Kimi-K2.6, poolside's Laguna-XS.2, and deepseek-ai's DeepSeek-V4-Flash.

Key takeaway

For AI architects and engineering leaders evaluating open-source large language models, recognize that current benchmark evaluations like CAISI's Elo score or Epoch AI's ECI may underrepresent true capabilities due to simplified testing environments. You should prioritize real-world performance testing with model-specific harnesses and prompting to accurately assess an open model's fit for complex tasks, especially for long-horizon or agentic applications, before making deployment decisions.

Key insights

Open models show progress, but their performance gap with closed models is debated due to evaluation methodology limitations.

Principles

Benchmark design significantly impacts perceived model capabilities.
Real-world performance can exceed standardized benchmark scores.

Method

CAISI and Epoch AI use Elo scores and Item Response Theory (IRT) over diverse benchmarks to compare model capabilities, even with differing test sets.

In practice

Consider model-specific prompting for accurate capability elicitation.
Evaluate coding models with realistic harnesses, not just simple setups.

Topics

Open Language Models
AI Model Evaluation
CAISI Assessment
Item Response Theory
DeepSeek V4

Code references

Best for: CTO, VP of Engineering/Data, AI Architect, AI Scientist, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Interconnects AI.