Vortex: Multi-Modal Fusion System for Intelligent Video Retrieval

2026-06-18 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

Vortex is a multimodal video retrieval system developed by the FocusOnFun team for the Ho Chi Minh City AI Challenge 2025, aiming to advance intelligent multimedia search and temporal reasoning. The system integrates adaptive keyframe extraction and generates multimodal metadata using vision-language and speech models. Its core is a hybrid retrieval strategy that fuses CLIP and SigLIP2 embeddings via Reciprocal Rank Fusion to balance global and fine-grained semantics. For enhanced interactivity, Vortex incorporates Rocchio-based relevance feedback and a multi-stage temporal search mechanism for sequential event alignment. Built on Milvus and Elasticsearch, the architecture supports scalable indexing and efficient retrieval. In the official competition, Vortex achieved a 79.6/88 (90.5%) score in the Preliminary Round and an "Excellent" overall performance with "Outstanding" results in the question-answering task during the Final Round, confirming its hybrid approach's effectiveness.

Key takeaway

For AI Engineers developing intelligent video retrieval systems, Vortex demonstrates that fusing global and fine-grained embeddings, specifically CLIP and SigLIP2 via Reciprocal Rank Fusion, significantly enhances performance. You should consider integrating multimodal metadata generation and Rocchio-based relevance feedback to improve search precision and user interactivity. This approach, validated by "Outstanding" QA results, offers a robust foundation for scalable, context-aware multimedia search.

Key insights

Vortex fuses CLIP and SigLIP2 embeddings with relevance feedback for effective multimodal video retrieval.

Principles

Hybrid embedding fusion (CLIP/SigLIP2) balances semantic granularity.
Reciprocal Rank Fusion effectively combines diverse retrieval scores.
Relevance feedback enhances interactive search precision.

Method

Vortex extracts keyframes, generates multimodal metadata, then fuses CLIP and SigLIP2 embeddings via Reciprocal Rank Fusion. It refines results with Rocchio-based relevance feedback and temporal search.

In practice

Combine global (CLIP) and fine-grained (SigLIP2) embeddings for video search.
Implement Rocchio feedback for user-guided query refinement.
Utilize Milvus and Elasticsearch for scalable multimedia indexing.

Topics

Video Retrieval
Multimodal AI
Embedding Fusion
CLIP
SigLIP2
Relevance Feedback

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.