Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model
Summary
Hydra is a novel dual-head architecture that unifies document retrieval and autoregressive generation within a single vision-language model (VLM), specifically Qwen3.5-4B. It employs a single Low-Rank Adaptation (LoRA) adapter, trained exclusively for ColBERT-style late-interaction retrieval, which can be toggled at inference time. Enabling the adapter activates retrieval mode, producing multi-vector embeddings, while disabling it restores the base model's original generation quality, yielding byte-identical outputs in 100% of samples compared to an independent base-model pipeline. This design reduces peak GPU memory by 41% compared to dual-model setups. The system addresses three critical engineering requirements: attention-mode restoration, lm_head preservation, and KV-cache-aware decoding, which are essential for reliable generation. A proof-of-concept extends Hydra to Qwen2.5-Omni-3B, demonstrating generalization to audio retrieval and speech generation without additional training.
Key takeaway
For MLOps Engineers deploying document AI systems, Hydra offers a compelling solution to reduce GPU memory footprint and system complexity. By using a single VLM with a toggled LoRA adapter, you can achieve both high-quality retrieval and generation, eliminating the need for separate models. Be mindful of potential throughput overheads during concurrent serving due to adapter switching, and ensure proper implementation of attention mode restoration and lm_head preservation for stable generation.
Key insights
A single retrieval-trained LoRA adapter can unify document retrieval and generation in one VLM.
Principles
- LoRA's additive structure enables exact base model recovery.
- Retrieval-only LoRA training is sufficient for dual-head VLMs.
Method
Train a single LoRA adapter for retrieval, then toggle it at inference to switch between retrieval (LoRA-on, bidirectional attention) and generation (LoRA-off, causal attention) modes, ensuring lm_head preservation and KV-cache support.
In practice
- Reduce GPU memory by 41% with a single VLM for RAG.
- Implement attention mode restoration for reliable generation.
- Preserve base model's lm_head to prevent corruption.
Topics
- Hydra Architecture
- ColBERT Retrieval
- Autoregressive Generation
- LoRA Adapters
- GPU Memory Optimization
Code references
Best for: MLOps Engineer, AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.