Reward-Free Code Alignment from Pretrained or Fine-Tuned LLM: Unpacking the Trade-offs for Code Generation

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, extended

Summary

This study investigates the trade-offs of applying reward-free LLM alignment techniques, Direct Preference Optimization (DPO) and BoNBoN, to code generation models. Conducted on five state-of-the-art LLMs (Meta-Llama-3-8B, Qwen2.5-Coder-7B, CodeLlama-7b, deepseek-coder-1.3b, deepseek-coder-7b) and their instruction-tuned variants, the research evaluates alignment pathways starting from either pretrained or finetuned models. Functional requirements were assessed using HumanEval+, MBPP+, EvalPerf, and EvoEval, while non-functional requirements (readability, style, maintainability) were measured with the CODAL benchmark. Findings indicate that pretrained-to-aligned pathways yield larger relative improvements, such as CodeLlama-7b's +75% in non-functional and Llama3-8b's +42% in functional performance, albeit from lower baselines. Finetuned-to-aligned pathways generally offer smaller gains or even degradation. Non-functional requirements consistently improve more than functional ones through alignment.

Key takeaway

If you are aligning Large Language Models for code generation, prioritize non-functional requirements like security and style, as they show more consistent improvement. For maximizing relative performance gains, start with pretrained models, but be aware of their lower absolute baselines. If high absolute performance is critical, finetuned-to-aligned pathways are generally better, but carefully monitor for potential degradation during supervised fine-tuning. Empirically validate your chosen alignment technique for your specific model and objective.

Key insights

Reward-free alignment significantly improves code LLM non-functional properties, with pretrained models showing higher relative gains and finetuned models offering better absolute performance.

Principles

Method

The study used a modified SelfCodeAlign pipeline to generate preference datasets for DPO and BoNBoN. Models underwent supervised fine-tuning for 120k steps, followed by DPO or BoNBoN training for another 120k steps.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.