CodegenBench: Can LLMs Write Efficient Code Across Architectures?
Summary
CodegenBench is a new benchmark suite evaluating large language models' (LLMs) ability to generate efficient parallel code across diverse hardware architectures: x86_64, Sunway, and Kunpeng. It comprises 106 standard Basic Linear Algebra Subprograms (BLAS) routines and 20 specialized computational kernels (LeetSunway, LeetKunpeng) adapted for supercomputing platforms. An extensive evaluation of 9 LLMs, including DeepSeek V3.2, DeepSeek V4 Flash, DeepSeek V4 Pro, Qwen3.5 Plus, Qwen 3.6 Flash, Claude Sonnet 4.6, Claude Opus 4.6, Claude Opus 4.7, and Qwen 3.6 Plus, revealed that while models generate optimized code for ubiquitous x86_64, they suffer significant performance degradation on domain-specific architectures like Sunway and Kunpeng. This limitation is attributed to the scarcity of public documentation and training data for these specialized platforms, particularly for moderately difficult problems requiring concise code snippets. The dataset and automated evaluation infrastructure are open-sourced.
Key takeaway
For AI Engineers developing high-performance computing solutions, recognize that current LLMs excel at generating optimized code for widely-documented x86_64 architectures but struggle significantly with specialized platforms like Sunway or Kunpeng. You should plan for substantial manual intervention and expert knowledge when targeting less common HPC environments, as LLMs frequently hallucinate non-existent intrinsics or produce suboptimal code. Prioritize LLM use for foundational BLAS routines on mainstream CPUs and for tasks requiring concise code snippets, while reserving complex, architecture-specific optimizations for human experts.
Key insights
LLMs generate efficient code for common architectures but struggle with specialized HPC platforms due to data scarcity.
Principles
- LLM code quality degrades with architecture specificity.
- Performance gains are non-linear with model size.
- Code correctness inversely correlates with length.
Method
CodegenBench uses a four-stage automated pipeline: pre-processing, prompt and code generation, compilation (build), and execution with validation against expected outcomes and performance metrics.
In practice
- Prioritize LLMs for x86_64 code generation.
- Expect manual optimization for Sunway/Kunpeng.
- Focus LLMs on concise, moderately complex tasks.
Topics
- LLM Code Generation
- High-Performance Computing
- x86_64 Architecture
- Sunway Supercomputer
- Kunpeng Processors
- BLAS Routines
- Cross-Architecture Portability
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.