CodegenBench: Can LLMs Write Efficient Code Across Architectures?

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, High-Performance Computing · Depth: Expert, extended

Summary

CodegenBench is a new benchmark suite evaluating large language models' (LLMs) ability to generate efficient parallel code across diverse hardware architectures: x86_64, Sunway, and Kunpeng. It comprises 106 standard Basic Linear Algebra Subprograms (BLAS) routines and 20 specialized computational kernels (LeetSunway, LeetKunpeng) adapted for supercomputing platforms. An extensive evaluation of 9 LLMs, including DeepSeek V3.2, DeepSeek V4 Flash, DeepSeek V4 Pro, Qwen3.5 Plus, Qwen 3.6 Flash, Claude Sonnet 4.6, Claude Opus 4.6, Claude Opus 4.7, and Qwen 3.6 Plus, revealed that while models generate optimized code for ubiquitous x86_64, they suffer significant performance degradation on domain-specific architectures like Sunway and Kunpeng. This limitation is attributed to the scarcity of public documentation and training data for these specialized platforms, particularly for moderately difficult problems requiring concise code snippets. The dataset and automated evaluation infrastructure are open-sourced.

Key takeaway

For AI Engineers developing high-performance computing solutions, recognize that current LLMs excel at generating optimized code for widely-documented x86_64 architectures but struggle significantly with specialized platforms like Sunway or Kunpeng. You should plan for substantial manual intervention and expert knowledge when targeting less common HPC environments, as LLMs frequently hallucinate non-existent intrinsics or produce suboptimal code. Prioritize LLM use for foundational BLAS routines on mainstream CPUs and for tasks requiring concise code snippets, while reserving complex, architecture-specific optimizations for human experts.

Key insights

LLMs generate efficient code for common architectures but struggle with specialized HPC platforms due to data scarcity.

Principles

Method

CodegenBench uses a four-stage automated pipeline: pre-processing, prompt and code generation, compilation (build), and execution with validation against expected outcomes and performance metrics.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.