Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model

2026-04-16 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

Hydra is a novel dual-head architecture that unifies document retrieval and autoregressive generation within a single vision-language model (VLM), specifically Qwen3.5-4B. It employs a single Low-Rank Adaptation (LoRA) adapter, trained exclusively for ColBERT-style late-interaction retrieval, which can be toggled at inference time. Enabling the adapter activates retrieval mode, producing multi-vector embeddings, while disabling it restores the base model's original generation quality, yielding byte-identical outputs in 100% of samples compared to an independent base-model pipeline. This design reduces peak GPU memory by 41% compared to dual-model setups. The system addresses three critical engineering requirements: attention-mode restoration, lm_head preservation, and KV-cache-aware decoding, which are essential for reliable generation. A proof-of-concept extends Hydra to Qwen2.5-Omni-3B, demonstrating generalization to audio retrieval and speech generation without additional training.

Key takeaway

For MLOps Engineers deploying document AI systems, Hydra offers a compelling solution to reduce GPU memory footprint and system complexity. By using a single VLM with a toggled LoRA adapter, you can achieve both high-quality retrieval and generation, eliminating the need for separate models. Be mindful of potential throughput overheads during concurrent serving due to adapter switching, and ensure proper implementation of attention mode restoration and lm_head preservation for stable generation.

Key insights

A single retrieval-trained LoRA adapter can unify document retrieval and generation in one VLM.

Principles

LoRA's additive structure enables exact base model recovery.
Retrieval-only LoRA training is sufficient for dual-head VLMs.

Method

Train a single LoRA adapter for retrieval, then toggle it at inference to switch between retrieval (LoRA-on, bidirectional attention) and generation (LoRA-off, causal attention) modes, ensuring lm_head preservation and KV-cache support.

In practice

Reduce GPU memory by 41% with a single VLM for RAG.
Implement attention mode restoration for reliable generation.
Preserve base model's lm_head to prevent corruption.

Topics

Hydra Architecture
ColBERT Retrieval
Autoregressive Generation
LoRA Adapters
GPU Memory Optimization

Code references

athrael-soju/hydra

Best for: MLOps Engineer, AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.