The Structured Output Benchmark (SOB) - validates both JSON parse and value accuracy [R]

2026-04-28 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, quick

Summary

The Structured Output Benchmark (SOB) is a new evaluation framework designed to assess the accuracy of AI models in generating structured data, specifically JSON. Unlike existing benchmarks that primarily validate JSON schema and type compliance, SOB focuses on the accuracy of leaf-level values within the JSON output. It measures seven key metrics, including Value Accuracy, JSON Pass Rate, Type Safety, Path Recall, Structure Coverage, Faithfulness (grounding in context), and Perfect Response. Initial results indicate that open-source models like GLM 4.7 perform competitively, ranking second only to GPT 5.4. A significant gap exists between JSON schema pass rates (often 90%+) and value accuracy, highlighting that models frequently produce syntactically correct but factually inaccurate structured data. The benchmark also provides breakdowns by modality, covering text, image, and audio inputs.

Key takeaway

For AI Architects and NLP Engineers evaluating models for deterministic structured output tasks, recognize that high JSON schema pass rates do not guarantee accurate data. Focus on benchmarks like SOB that emphasize value accuracy and faithfulness to avoid deploying models that hallucinate critical data points. Your selection criteria should heavily weigh a model's ability to produce factually correct, contextually grounded values, especially for applications involving financial data or ordered arrays.

Key insights

The Structured Output Benchmark (SOB) measures JSON value accuracy, revealing a significant gap between schema compliance and factual correctness.

Principles

Value accuracy is critical for structured output.
Faithfulness ensures values are contextually grounded.

Method

SOB measures seven metrics: Value Accuracy, JSON Pass Rate, Type Safety, Path Recall, Structure Coverage, Faithfulness, and Perfect Response across text, image, and audio modalities.

In practice

Prioritize value accuracy over JSON schema pass rate.
Evaluate model performance across diverse modalities.

Topics

Structured Output Benchmark
Value Accuracy
JSON Schema Validation
Large Language Models
Multimodal AI

Best for: Research Scientist, AI Architect, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.