SAKE: Software Architectural Knowledge Evaluation Benchmark for Large Language Models
Summary
SAKE (Software Architectural Knowledge Evaluation) is a new, standardized benchmark designed to assess Large Language Models' (LLMs) ability to reason about software architecture. It comprises 2154 expert-curated multiple-choice questions, each with four options, stratified across eight architectural categories and four context-length levels. The benchmark evaluates 11 proprietary and open-weight models in both zero-shot and five-shot settings. Results show consistently high overall accuracy, ranging from 89.31% to 94.23%, but performance varies significantly across categories, with Architectural Solutions and Quantum Computing being the most challenging. The study also found that prompt context length is not uniformly beneficial, helping recall-oriented tasks but degrading accuracy on reasoning-heavy categories. SAKE, its evaluation scripts, and results are open-source, providing a baseline for tracking LLM architectural reasoning.
Key takeaway
For AI Architects and Machine Learning Engineers deploying LLMs for software design, you should consult category-level benchmark results like SAKE's to understand specific model strengths and weaknesses. Do not assume uniform reliability; models excel in recall but struggle with complex trade-off reasoning, especially with longer prompts. Prioritize human oversight for critical architectural decisions, and consider cost-effective models like Qwen 3 235B for tasks where top-tier accuracy offers marginal gains.
Key insights
LLM architectural knowledge is high overall but uneven, with context length effects varying by task type.
Principles
- LLM architectural competence is category-dependent.
- Aggregate accuracy scores hide specific competency gaps.
- Context length benefits recall but hinders complex reasoning.
Method
SAKE's methodology involves defining 8 architectural knowledge categories from canonical references, expert-curating 2154 multiple-choice questions with dual peer review, and evaluating 11 LLMs in zero-shot and five-shot settings.
In practice
- Use SAKE to identify LLM architectural competency gaps.
- Prioritize cost-effective models for knowledge recall tasks.
- Apply human scrutiny to LLM architectural trade-off advice.
Topics
- Software Architecture
- Large Language Models
- LLM Benchmarking
- Architectural Knowledge
- Design Patterns
- Quality Attributes
- Quantum Computing
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.