Hierarchical Embedding Fusion for Retrieval-Augmented Code Generation

2026-03-10 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, extended

Summary

Hierarchical Embedding Fusion (HEF) is a two-stage system designed for low-latency, repository-aware code completion. It addresses the challenge of coupling online cost to repository size and long-context noise in traditional retrieval-augmented generation. HEF operates by first compressing repository chunks into a hierarchical cache of dense vectors offline using a small 0.5B-parameter fuser model. Online, it retrieves a fixed number of these vectors, mapping them into learned pseudo-tokens that condition a 1.3B-parameter code generator. This approach replaces thousands of raw retrieved tokens with a bounded pseudo-token budget, enabling sub-second median latency (0.68s on a single A100) while achieving comparable exact-match accuracy to snippet-based baselines on RepoBench and RepoEval. HEF reduces median end-to-end latency by 13x–26x compared to graph- and iterative-retrieval systems.

Key takeaway

For AI Engineers building real-time code completion tools, HEF offers a practical recipe for integrating repository-level context without incurring high latency. You should consider adopting a hierarchical dense caching strategy with a pseudo-token interface to achieve sub-second response times, especially when responsiveness is a primary concern over absolute maximum accuracy. This approach allows your systems to scale efficiently with large codebases.

Key insights

Hierarchical dense caching with pseudo-tokens enables fast, repository-aware code completion by decoupling online cost from repository size.

Principles

Compress context offline into a reusable hierarchy.
Use pseudo-tokens to bound online prompt length.
Joint end-to-end training improves alignment.

Method

HEF constructs an offline hierarchical cache of dense vectors from code chunks using a fuser model. Online, it retrieves top-K vectors and projects them into pseudo-tokens to condition a code generator.

In practice

Filter training contexts using utility-weighted likelihood.
30-40 pseudo-tokens capture most repository information.
A small fuser model (0.5B) is sufficient for compression.

Topics

Hierarchical Embedding Fusion
Retrieval-Augmented Code Generation
Code Completion
Pseudo-Token Conditioning
Dense Retrieval

Best for: AI Engineer, AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, Deep Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.