Beyond Code Pairs: Dialogue-Based Data Generation for LLM Code Translation

2025-10-07 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

A novel automated dataset generation pipeline, featuring a dual-LLM Questioner–Solver design, significantly enhances code translation capabilities for low-resource programming domains like Fortran and CUDA. This pipeline integrates external knowledge from compilers and runtime feedback to generate not only verified source–target code pairs with unit tests but also multi-turn dialogues capturing the translation reasoning process. Applied to Fortran→C++ and C++→CUDA, it produced 3.64k and 3.93k dialogues respectively. Fine-tuning on this data dramatically improved functional correctness, boosting unit test success rates by over 56% on the challenging C++-to-CUDA task. A 7B open-weight model fine-tuned with this data even outperformed larger proprietary systems on key metrics like compilation success.

Key takeaway

For Machine Learning Engineers developing code translation solutions, consider adopting dialogue-based data generation to significantly improve LLM performance in low-resource languages like Fortran and specialized frameworks such as CUDA. Fine-tune models on multi-turn dialogues or Question-Solution pairs, selecting the data granularity that best suits the task's syntactic or semantic complexity, to achieve higher functional correctness and potentially outperform larger proprietary models.

Key insights

A dual-LLM Questioner–Solver generates dialogue-based data, improving LLM code translation in low-resource domains.

Principles

Integrating external feedback enhances LLM reasoning.
Dialogue data captures iterative refinement processes.
Data granularity impacts task performance.

Method

A dual-LLM Questioner–Solver pipeline generates multi-turn dialogues, unit tests, and verified translations by integrating compiler and runtime feedback for iterative refinement.

In practice

Fine-tune LLMs with dialogue data for Fortran→C++ and C++→CUDA.
Use QS-Pair data for syntactically complex tasks.
Employ full dialogue traces for semantic correctness.

Topics

Code Translation
Large Language Models
Data Generation
Fortran
C++
CUDA

Best for: AI Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.