JETO-Bench: A Reproducible Benchmark for Execution Time Improvement Patches in Java

2025-11-28 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

JETO-Mine is introduced as the first configurable and reusable tool for creating reproducible benchmarks of execution time improvement patches (ETIPs) in real-world Java projects. It employs a three-phase pipeline: static analysis to identify ETIPs from GitHub repositories using user-defined filters and an LLM-based classifier, dynamic analysis to wrap ETIPs in Docker images for reproducible execution and statistical testing, and an evaluation harness for quantitative assessment. Using JETO-Mine, JETO-Bench was built, comprising 660 identified ETIPs and 91 manually verified executable ETIPs from 174 open-source Java repositories, scanning 11 years and nearly 1.8 million commits. An evaluation of OpenHands on JETO-Bench showed it correctly fixed 14.3% (13/91) of issues, aligning with results from other languages. The study also highlights a significant lack of tests demonstrating execution time improvements in open-source Java projects.

Key takeaway

For AI Engineers or Research Scientists developing automated program repair tools for Java, JETO-Bench provides a robust, reproducible environment for evaluating execution time improvement patches. You should leverage JETO-Mine for continuous benchmark collection and its evaluation harness to gain precise, execution-based feedback on generated patches. This will help you address the current limitations of coding agents and the scarcity of performance-specific tests in Java projects.

Key insights

JETO-Mine creates reproducible Java performance benchmarks, revealing agent limitations and testing gaps.

Principles

Java performance benchmarking demands statistical rigor due to JVM characteristics.
Execution-based feedback is crucial for accurate patch assessment.
Open-source Java projects largely lack dedicated performance tests.

Method

JETO-Mine's pipeline includes static analysis (GitHub crawl, LLM filter), dynamic analysis (Docker containerization, statistical testing), and an evaluation harness for patches and tests.

In practice

Utilize JETO-Mine to continuously collect new Java ETIP benchmarks.
Evaluate coding agents on real-world Java performance issues.
Generate ETIP detector tests for Java projects using the evaluation harness.

Topics

Java Performance Optimization
Automated Program Repair
Execution Time Improvement Patches
Performance Benchmarking
Large Language Models
Docker Containerization
JETO-Bench

Code references

Best for: AI Scientist, Research Scientist, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.