JETO-Bench: A Reproducible Benchmark for Execution Time Improvement Patches in Java
Summary
JETO-Mine is introduced as the first configurable and reusable tool for creating reproducible benchmarks of execution time improvement patches (ETIPs) in real-world Java projects. It employs a three-phase pipeline: static analysis to identify ETIPs from GitHub repositories using user-defined filters and an LLM-based classifier, dynamic analysis to wrap ETIPs in Docker images for reproducible execution and statistical testing, and an evaluation harness for quantitative assessment. Using JETO-Mine, JETO-Bench was built, comprising 660 identified ETIPs and 91 manually verified executable ETIPs from 174 open-source Java repositories, scanning 11 years and nearly 1.8 million commits. An evaluation of OpenHands on JETO-Bench showed it correctly fixed 14.3% (13/91) of issues, aligning with results from other languages. The study also highlights a significant lack of tests demonstrating execution time improvements in open-source Java projects.
Key takeaway
For AI Engineers or Research Scientists developing automated program repair tools for Java, JETO-Bench provides a robust, reproducible environment for evaluating execution time improvement patches. You should leverage JETO-Mine for continuous benchmark collection and its evaluation harness to gain precise, execution-based feedback on generated patches. This will help you address the current limitations of coding agents and the scarcity of performance-specific tests in Java projects.
Key insights
JETO-Mine creates reproducible Java performance benchmarks, revealing agent limitations and testing gaps.
Principles
- Java performance benchmarking demands statistical rigor due to JVM characteristics.
- Execution-based feedback is crucial for accurate patch assessment.
- Open-source Java projects largely lack dedicated performance tests.
Method
JETO-Mine's pipeline includes static analysis (GitHub crawl, LLM filter), dynamic analysis (Docker containerization, statistical testing), and an evaluation harness for patches and tests.
In practice
- Utilize JETO-Mine to continuously collect new Java ETIP benchmarks.
- Evaluate coding agents on real-world Java performance issues.
- Generate ETIP detector tests for Java projects using the evaluation harness.
Topics
- Java Performance Optimization
- Automated Program Repair
- Execution Time Improvement Patches
- Performance Benchmarking
- Large Language Models
- Docker Containerization
- JETO-Bench
Code references
Best for: AI Scientist, Research Scientist, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.