SWE-InfraBench: Evaluating Language Models on Cloud Infrastructure Code

2025-04-14 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Expert, extended

Summary

SWE-InfraBench is a new benchmark dataset designed to evaluate large language models' capabilities in modifying imperative Infrastructure-as-Code (IaC) using the AWS Cloud Development Kit (CDK). Unlike existing benchmarks that focus on declarative IaC like Terraform or full codebase generation, SWE-InfraBench comprises 100 diverse tasks sourced from 34 real-world IaC codebases, challenging LLMs to perform incremental code edits based on natural language instructions. Success is determined by passing provided unit tests, requiring sophisticated reasoning about cloud resource dependencies. Initial evaluations reveal significant limitations, with the best-performing model, Claude Sonnet 3.7, achieving only a 34% success rate, and DeepSeek R1 at 24%. However, multi-turn agentic approaches, incorporating error feedback and Retrieval-Augmented Generation (RAG), substantially boost performance, with Claude 3.5 Sonnet V2 reaching 65% correctness. The dataset is publicly available on Kaggle.

Key takeaway

For AI Engineers developing solutions for cloud infrastructure management, you should recognize that current LLMs, while capable, significantly struggle with imperative Infrastructure-as-Code modification tasks in AWS CDK, achieving only around 34% success in single attempts. To improve reliability and correctness, integrate multi-turn agentic approaches that provide detailed error messages and test results, potentially boosting success rates up to 65%. Consider incorporating Retrieval-Augmented Generation (RAG) with relevant documentation to further enhance model performance, especially for complex dependency reasoning.

Key insights

LLMs struggle with imperative IaC code modification, but iterative feedback and RAG significantly improve performance.

Principles

Imperative IaC editing requires deep reasoning about cloud dependencies.
Iterative feedback loops enhance LLM code generation correctness.
Detailed diagnostic information is crucial for effective error correction.

Method

SWE-InfraBench uses a multi-stage pipeline combining human expertise with LLM assistance for task generation, critique, and refinement, ensuring high-quality, verifiable IaC modification challenges.

In practice

Implement multi-turn LLM agents for IaC development tasks.
Provide high-verbosity error feedback to models during refinement.
Integrate RAG with relevant documentation for enhanced context.

Topics

AWS CDK
IaC Evaluation
LLM Code Editing
Multi-Turn Agents
Retrieval-Augmented Generation
Cloud Infrastructure

Best for: Research Scientist, AI Scientist, AI Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.