Beyond Code Pairs: Dialogue-Based Data Generation for LLM Code Translation

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

A novel automated dataset generation pipeline, featuring a dual-LLM Questioner–Solver design, significantly enhances code translation capabilities for low-resource programming domains like Fortran and CUDA. This pipeline integrates external knowledge from compilers and runtime feedback to generate not only verified source–target code pairs with unit tests but also multi-turn dialogues capturing the translation reasoning process. Applied to Fortran→C++ and C++→CUDA, it produced 3.64k and 3.93k dialogues respectively. Fine-tuning on this data dramatically improved functional correctness, boosting unit test success rates by over 56% on the challenging C++-to-CUDA task. A 7B open-weight model fine-tuned with this data even outperformed larger proprietary systems on key metrics like compilation success.

Key takeaway

For Machine Learning Engineers developing code translation solutions, consider adopting dialogue-based data generation to significantly improve LLM performance in low-resource languages like Fortran and specialized frameworks such as CUDA. Fine-tune models on multi-turn dialogues or Question-Solution pairs, selecting the data granularity that best suits the task's syntactic or semantic complexity, to achieve higher functional correctness and potentially outperform larger proprietary models.

Key insights

A dual-LLM Questioner–Solver generates dialogue-based data, improving LLM code translation in low-resource domains.

Principles

Method

A dual-LLM Questioner–Solver pipeline generates multi-turn dialogues, unit tests, and verified translations by integrating compiler and runtime feedback for iterative refinement.

In practice

Topics

Best for: AI Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.