Reward-Free Code Alignment from Pretrained or Fine-Tuned LLM: Unpacking the Trade-offs for Code Generation
Summary
This study investigates the trade-offs of applying reward-free LLM alignment techniques, Direct Preference Optimization (DPO) and BoNBoN, to code generation models. Conducted on five state-of-the-art LLMs (Meta-Llama-3-8B, Qwen2.5-Coder-7B, CodeLlama-7b, deepseek-coder-1.3b, deepseek-coder-7b) and their instruction-tuned variants, the research evaluates alignment pathways starting from either pretrained or finetuned models. Functional requirements were assessed using HumanEval+, MBPP+, EvalPerf, and EvoEval, while non-functional requirements (readability, style, maintainability) were measured with the CODAL benchmark. Findings indicate that pretrained-to-aligned pathways yield larger relative improvements, such as CodeLlama-7b's +75% in non-functional and Llama3-8b's +42% in functional performance, albeit from lower baselines. Finetuned-to-aligned pathways generally offer smaller gains or even degradation. Non-functional requirements consistently improve more than functional ones through alignment.
Key takeaway
If you are aligning Large Language Models for code generation, prioritize non-functional requirements like security and style, as they show more consistent improvement. For maximizing relative performance gains, start with pretrained models, but be aware of their lower absolute baselines. If high absolute performance is critical, finetuned-to-aligned pathways are generally better, but carefully monitor for potential degradation during supervised fine-tuning. Empirically validate your chosen alignment technique for your specific model and objective.
Key insights
Reward-free alignment significantly improves code LLM non-functional properties, with pretrained models showing higher relative gains and finetuned models offering better absolute performance.
Principles
- Pretrained models offer higher plasticity for alignment.
- Non-functional code alignment is more consistent.
- Baseline performance dictates alignment viability.
Method
The study used a modified SelfCodeAlign pipeline to generate preference datasets for DPO and BoNBoN. Models underwent supervised fine-tuning for 120k steps, followed by DPO or BoNBoN training for another 120k steps.
In practice
- Prioritize non-functional alignment objectives.
- Select code-specialized LLMs for alignment.
- Monitor SFT-stage degradation as a risk indicator.
Topics
- LLM Alignment
- Code Generation
- Direct Preference Optimization
- BoNBoN
- Non-functional Code Quality
- Pretrained LLMs
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.