Outrunning LLM Cutoffs: A Live Kernel Crash Resolution Benchmark for All

2026-06-18 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

Researchers introduce Live-kBench, a self-evolving benchmark, and kEnv, an agent-agnostic environment, to address limitations in evaluating Large Language Model (LLM) agents for Linux kernel crash resolution. Existing benchmarks are static, risking data contamination and failing to reflect the kernel's dynamic nature. Live-kBench continuously curates and evaluates agents on freshly discovered bugs, while kEnv standardizes the execution environment for fair, scalable comparisons. An inaugural dataset, Live-kBench-2512, comprises 534 Linux kernel bugs from April 2024 to December 2025. Empirical results show LLM agents achieve up to 25% higher equivalent patch rates on bugs fixed before their knowledge cutoffs. State-of-the-art agents, including mini-SWE-agent, SWE-agent, and OpenHands, resolve 74% of crashes on the first attempt, but only approximately 20% of generated patches closely match developer fixes. Additionally, providing crash resolution feedback improves the crash resolution rate by 29%.

Key takeaway

For AI Engineers developing LLM-based agents for Linux kernel bug resolution, you must prioritize dynamic evaluation against fresh, post-cutoff data to avoid inflated performance metrics. Implement iterative crash resolution feedback mechanisms, as this significantly boosts crash resolution rates by 29%. Be aware that even with high crash resolution, your agents will likely produce patches that only ~20% match human developer fixes, indicating a need for deeper semantic understanding or more robust validation.

Key insights

LLM-based kernel bug fixing requires dynamic, contamination-free evaluation and standardized execution environments to accurately assess agent performance.

Principles

Static benchmarks risk data contamination.
Decouple agent logic from heavy execution.
Feedback improves crash resolution rate.

Method

Live-kBench continuously curates fresh kernel bugs from Syzbot, filters them, then invokes agents via kEnv. kEnv standardizes the Linux kernel environment, providing a "run_kernel" interface for crash resolution feedback and patch evaluation.

In practice

Prioritize fresh data for LLM agent evaluation.
Implement crash resolution feedback loops.
Expect low patch equivalence to human fixes.

Topics

Linux Kernel
LLM Agents
Automated Program Repair
Live Benchmarking
Data Contamination
Crash Resolution Feedback
Syzbot

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.