The Structured Output Benchmark (SOB) - validates both JSON parse and value accuracy [R]
Summary
The Structured Output Benchmark (SOB) is a new evaluation framework designed to assess the accuracy of AI models in generating structured data, specifically JSON. Unlike existing benchmarks that primarily validate JSON schema and type compliance, SOB focuses on the accuracy of leaf-level values within the JSON output. It measures seven key metrics, including Value Accuracy, JSON Pass Rate, Type Safety, Path Recall, Structure Coverage, Faithfulness (grounding in context), and Perfect Response. Initial results indicate that open-source models like GLM 4.7 perform competitively, ranking second only to GPT 5.4. A significant gap exists between JSON schema pass rates (often 90%+) and value accuracy, highlighting that models frequently produce syntactically correct but factually inaccurate structured data. The benchmark also provides breakdowns by modality, covering text, image, and audio inputs.
Key takeaway
For AI Architects and NLP Engineers evaluating models for deterministic structured output tasks, recognize that high JSON schema pass rates do not guarantee accurate data. Focus on benchmarks like SOB that emphasize value accuracy and faithfulness to avoid deploying models that hallucinate critical data points. Your selection criteria should heavily weigh a model's ability to produce factually correct, contextually grounded values, especially for applications involving financial data or ordered arrays.
Key insights
The Structured Output Benchmark (SOB) measures JSON value accuracy, revealing a significant gap between schema compliance and factual correctness.
Principles
- Value accuracy is critical for structured output.
- Faithfulness ensures values are contextually grounded.
Method
SOB measures seven metrics: Value Accuracy, JSON Pass Rate, Type Safety, Path Recall, Structure Coverage, Faithfulness, and Perfect Response across text, image, and audio modalities.
In practice
- Prioritize value accuracy over JSON schema pass rate.
- Evaluate model performance across diverse modalities.
Topics
- Structured Output Benchmark
- Value Accuracy
- JSON Schema Validation
- Large Language Models
- Multimodal AI
Best for: Research Scientist, AI Architect, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.