SAKE: Software Architectural Knowledge Evaluation Benchmark for Large Language Models

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

SAKE (Software Architectural Knowledge Evaluation) is a new, standardized benchmark designed to assess Large Language Models' (LLMs) ability to reason about software architecture. It comprises 2154 expert-curated multiple-choice questions, each with four options, stratified across eight architectural categories and four context-length levels. The benchmark evaluates 11 proprietary and open-weight models in both zero-shot and five-shot settings. Results show consistently high overall accuracy, ranging from 89.31% to 94.23%, but performance varies significantly across categories, with Architectural Solutions and Quantum Computing being the most challenging. The study also found that prompt context length is not uniformly beneficial, helping recall-oriented tasks but degrading accuracy on reasoning-heavy categories. SAKE, its evaluation scripts, and results are open-source, providing a baseline for tracking LLM architectural reasoning.

Key takeaway

For AI Architects and Machine Learning Engineers deploying LLMs for software design, you should consult category-level benchmark results like SAKE's to understand specific model strengths and weaknesses. Do not assume uniform reliability; models excel in recall but struggle with complex trade-off reasoning, especially with longer prompts. Prioritize human oversight for critical architectural decisions, and consider cost-effective models like Qwen 3 235B for tasks where top-tier accuracy offers marginal gains.

Key insights

LLM architectural knowledge is high overall but uneven, with context length effects varying by task type.

Principles

Method

SAKE's methodology involves defining 8 architectural knowledge categories from canonical references, expert-curating 2154 multiple-choice questions with dual peer review, and evaluating 11 LLMs in zero-shot and five-shot settings.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.