Vortex: Multi-Modal Fusion System for Intelligent Video Retrieval
Summary
Vortex is a multimodal video retrieval system developed by the FocusOnFun team for the Ho Chi Minh City AI Challenge 2025, aiming to advance intelligent multimedia search and temporal reasoning. The system integrates adaptive keyframe extraction and generates multimodal metadata using vision-language and speech models. Its core is a hybrid retrieval strategy that fuses CLIP and SigLIP2 embeddings via Reciprocal Rank Fusion to balance global and fine-grained semantics. For enhanced interactivity, Vortex incorporates Rocchio-based relevance feedback and a multi-stage temporal search mechanism for sequential event alignment. Built on Milvus and Elasticsearch, the architecture supports scalable indexing and efficient retrieval. In the official competition, Vortex achieved a 79.6/88 (90.5%) score in the Preliminary Round and an "Excellent" overall performance with "Outstanding" results in the question-answering task during the Final Round, confirming its hybrid approach's effectiveness.
Key takeaway
For AI Engineers developing intelligent video retrieval systems, Vortex demonstrates that fusing global and fine-grained embeddings, specifically CLIP and SigLIP2 via Reciprocal Rank Fusion, significantly enhances performance. You should consider integrating multimodal metadata generation and Rocchio-based relevance feedback to improve search precision and user interactivity. This approach, validated by "Outstanding" QA results, offers a robust foundation for scalable, context-aware multimedia search.
Key insights
Vortex fuses CLIP and SigLIP2 embeddings with relevance feedback for effective multimodal video retrieval.
Principles
- Hybrid embedding fusion (CLIP/SigLIP2) balances semantic granularity.
- Reciprocal Rank Fusion effectively combines diverse retrieval scores.
- Relevance feedback enhances interactive search precision.
Method
Vortex extracts keyframes, generates multimodal metadata, then fuses CLIP and SigLIP2 embeddings via Reciprocal Rank Fusion. It refines results with Rocchio-based relevance feedback and temporal search.
In practice
- Combine global (CLIP) and fine-grained (SigLIP2) embeddings for video search.
- Implement Rocchio feedback for user-guided query refinement.
- Utilize Milvus and Elasticsearch for scalable multimedia indexing.
Topics
- Video Retrieval
- Multimodal AI
- Embedding Fusion
- CLIP
- SigLIP2
- Relevance Feedback
Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.