Is BM25 Enough for Agentic Deep Research?, Replacing Decoders with MLPs in Generative Recommendation, and More!
Summary
This week's intelligence brief highlights ten research papers and two benchmarks in information retrieval and recommender systems. Key advancements include PI-SERINI, a minimal search agent that pairs BM25 lexical retrieval with capable LLMs for deep research, achieving 83.1% accuracy on BrowseComp-Plus while cutting costs 3.3x to 10x. Jina AI introduced jina-embeddings-v5-omni, multimodal embedding models (0.95B "nano", 1.57B "small") that extend text-only backbones to images, video, and audio without retraining. Guo et al. presented SID-MLP, a distillation framework that replaces Transformer decoders with MLPs for generative recommendation, yielding an 8.74x throughput speedup and 95.7% peak-memory reduction. IBM Research released Granite Embedding Multilingual R2 models, supporting 200+ languages with a 32,768-token context window. New benchmarks, ASTRA-QA and TRIVIA+, address abstract question answering and LLM hallucination detection, respectively.
Key takeaway
For AI Architects evaluating retrieval-augmented generation (RAG) systems, consider that simple lexical retrieval (BM25) combined with capable LLMs can achieve high accuracy and significant cost reductions, challenging the necessity of complex dense retrievers. Your teams should investigate agentic search frameworks like PI-SERINI to optimize both performance and inference costs, especially when dealing with extensive document sets and aiming for efficient resource utilization.
Key insights
Lexical retrieval with LLMs can outperform dense retrievers while multimodal embeddings can extend frozen text backbones.
Principles
- Simple heuristics can expose benchmark shortcuts.
- Multimodal embeddings can be created without backbone retraining.
- RAG failures stem from premature question-to-answer routing.
Method
PI-SERINI uses a minimal search agent with BM25, caching, and selective text pulling. Jina-embeddings-v5-omni bolts pretrained encoders onto a frozen text model, training only thin projector layers.
In practice
- Tune BM25 for long, noisy documents (e.g., k1=25, b=1).
- Use MLP heads to accelerate generative recommendation by 8.74x.
- Employ program synthesis for multi-hop RAG to improve robustness.
Topics
- Agentic Search
- Generative Recommendation
- Multimodal Embeddings
- Retrieval-Augmented Generation
- Efficient Embeddings
Code references
- justram/pi-serini
- haoyuhan1/GraphRec
- ztguo715/SID-MLP
- codefuse-ai/CodeFuse-Embeddings
- hanxiao/embedding-ttc
Best for: AI Architect, AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Top Information Retrieval Papers of the Week.