AlignCoder: Aligning Retrieval with Target Intent for Repository-Level Code Completion

2023-11-03 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, long

Summary

AlignCoder is a novel repository-level code completion framework designed to overcome limitations of existing code large language models (LLMs) and retrieval-augmented generation (RAG) approaches. Current methods struggle with repository-specific context and suffer from query-target misalignment, failing to effectively utilize inference information. AlignCoder addresses these issues by introducing a query enhancement mechanism that generates multiple candidate completions to construct an "enhanced query," bridging the semantic gap between the initial query and the target code. Furthermore, it employs reinforcement learning to train an "AlignRetriever," enabling it to leverage inference information from the enhanced query for more accurate retrieval. Evaluated across five backbone code LLMs on CrossCodeEval and RepoEval benchmarks, AlignCoder demonstrated an 18.1% improvement in EM score on CrossCodeEval, proving its superior performance and generalizability.

Key takeaway

For ML Engineers developing repository-level code completion systems, you should re-evaluate traditional RAG approaches. AlignCoder demonstrates that enhancing queries with multiple candidate completions and training retrievers with reinforcement learning significantly improves accuracy, achieving an 18.1% EM score boost. This method helps overcome semantic misalignment and better utilizes inference information, making your code LLMs more effective for complex, repository-specific contexts.

Key insights

AlignCoder enhances RAG for repository-level code completion by using multiple candidate completions to refine queries and training a retriever with reinforcement learning.

Principles

Semantic misalignment between query and target code reduces retrieval accuracy.
Multiple sampling significantly increases the likelihood of correct completions.
Reinforcement learning can train retrievers to leverage inference information.

Method

AlignCoder employs BM25 for initial retrieval, then samples multiple candidate completions to construct an enhanced query. An AlignRetriever is trained using reinforcement learning, optimizing a reward function based on target code perplexity.

In practice

Use multiple candidate completions to create semantically richer retrieval queries.
Train custom retrievers with reinforcement learning, using perplexity as a reward signal.
Construct retrieval codebases from both base and dependency code snippets.

Topics

AlignCoder
Repository-level Code Completion
Retrieval-Augmented Generation
Code LLMs
Reinforcement Learning
Query Enhancement

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.