Even GenAI uses Wikipedia as a source
Summary
Wikimedia Deutschland's AI project lead, Philippe Saade, discusses the Wikidata Embedding Project, which vectorized 30 million of Wikidata's 119 million entries to enable semantic search and support open-source AI development. This initiative aims to alleviate the infrastructure burden caused by extensive scraping of Wikimedia sites by providing a more efficient data access point. The project transforms Wikidata items into textual representations, aggregating information from connected graph links, and then embeds these using Jina AI's embedding V3 model, specifically utilizing 512-token Matryoshka embeddings for optimal accuracy and resource efficiency. An MCP server is also integrated to assist LLMs in generating precise Sparkle queries by exploring the knowledge graph via the vector database. The alpha version, launched in October, is undergoing user testing to gather feedback on use cases and identify areas for improvement, with future plans for periodic vector updates.
Key takeaway
For AI Architects and NLP Engineers building applications that rely on large knowledge bases, consider adopting a vector database approach like the Wikidata Embedding Project. This strategy can significantly reduce the load on source APIs and enable more efficient semantic search and RAG applications. Explore how combining vector search for initial exploration with precise knowledge graph queries (e.g., Sparkle) can enhance your system's capabilities and data integrity, while also contributing to open-source AI development.
Key insights
Vectorizing Wikidata enables semantic search and offloads infrastructure strain from widespread data scraping.
Principles
- Cooperation is better than resistance for data access.
- Balance exploration (vector search) with precision (knowledge graph queries).
Method
Transform Wikidata items into textual representations by aggregating labels, descriptions, aliases, and statement properties, then embed these using a pre-trained model like Jina embedding V3, chunking data for efficiency.
In practice
- Use parquet for efficient processing of large datasets.
- Consider Matryoshka embeddings for flexible embedding sizes.
- Integrate vector databases with knowledge graphs for enhanced RAG applications.
Topics
- Wikidata Embedding Project
- Vector Databases
- Semantic Search
- Knowledge Graphs
- Generative AI
Code references
Best for: AI Architect, NLP Engineer, CTO, AI Engineer, Machine Learning Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Stack Overflow Blog.