CoCoMUT: A Tool for Code-Context Mining and Automated Dataset Generation

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Software Development & Engineering, Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, long

Summary

CoCoMUT is a Java tool designed for automated code-context mining and dataset generation, addressing the challenge of manually collecting method-level context for software-engineering assistants. It extracts comprehensive context for focal methods or generates datasets at class, package, or system scope. The tool discovers project structure, resolves build and classpath information, constructs a SootUp static call graph, and reconciles bytecode-level call edges with Spoon-based source extraction. Each method record combines source, class, documentation, call-graph, and metadata context, providing reproducible inputs for training and running learned software-engineering techniques. Evaluated on 20 real-world Java repositories (10 Maven, 10 Gradle), CoCoMUT processed all, emitting 56,512 method-context records and 386,048 serialized call edges. It achieved a 97.8% reconciliation rate for project-source bytecode targets and a 99.0% pass rate in a 200-record manual audit.

Key takeaway

For AI Scientists and Software Engineers developing context-aware software engineering techniques, CoCoMUT offers a robust, reproducible foundation for generating method-level context datasets. You can leverage its automated pipeline to create high-quality, versioned JSONL records, significantly reducing manual data collection efforts and improving the reliability of your model training and evaluation. Consider integrating CoCoMUT to standardize context extraction across diverse Java projects.

Key insights

CoCoMUT unifies source and bytecode analysis to generate reproducible, comprehensive method-level context datasets for Java projects.

Principles

Method

CoCoMUT's pipeline involves project discovery, build/classpath resolution, Spoon source analysis, SootUp call-graph construction, source–bytecode reconciliation, and versioned JSONL generation.

In practice

Topics

Code references

Best for: AI Scientist, Research Scientist, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.