Even GenAI uses Wikipedia as a source

2026-02-20 · Source: Stack Overflow Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, extended

Summary

Wikimedia Deutschland's AI project lead, Philippe Saade, discusses the Wikidata Embedding Project, which vectorized 30 million of Wikidata's 119 million entries to enable semantic search and support open-source AI development. This initiative aims to alleviate the infrastructure burden caused by extensive scraping of Wikimedia sites by providing a more efficient data access point. The project transforms Wikidata items into textual representations, aggregating information from connected graph links, and then embeds these using Jina AI's embedding V3 model, specifically utilizing 512-token Matryoshka embeddings for optimal accuracy and resource efficiency. An MCP server is also integrated to assist LLMs in generating precise Sparkle queries by exploring the knowledge graph via the vector database. The alpha version, launched in October, is undergoing user testing to gather feedback on use cases and identify areas for improvement, with future plans for periodic vector updates.

Key takeaway

For AI Architects and NLP Engineers building applications that rely on large knowledge bases, consider adopting a vector database approach like the Wikidata Embedding Project. This strategy can significantly reduce the load on source APIs and enable more efficient semantic search and RAG applications. Explore how combining vector search for initial exploration with precise knowledge graph queries (e.g., Sparkle) can enhance your system's capabilities and data integrity, while also contributing to open-source AI development.

Key insights

Vectorizing Wikidata enables semantic search and offloads infrastructure strain from widespread data scraping.

Principles

Cooperation is better than resistance for data access.
Balance exploration (vector search) with precision (knowledge graph queries).

Method

Transform Wikidata items into textual representations by aggregating labels, descriptions, aliases, and statement properties, then embed these using a pre-trained model like Jina embedding V3, chunking data for efficiency.

In practice

Use parquet for efficient processing of large datasets.
Consider Matryoshka embeddings for flexible embedding sizes.
Integrate vector databases with knowledge graphs for enhanced RAG applications.

Topics

Wikidata Embedding Project
Vector Databases
Semantic Search
Knowledge Graphs
Generative AI

Code references

philippesaade-wmde/WikidataTextEmbedding

Best for: AI Architect, NLP Engineer, CTO, AI Engineer, Machine Learning Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Stack Overflow Blog.