SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation
Summary
SkMTEB introduces the first comprehensive MTEB-style text embedding benchmark for Slovak, a low-resource West Slavic language, featuring 31 datasets across 7 task types, nearly four times the depth of existing multilingual benchmarks for Slovak. Evaluation of 31 embedding models revealed that large instruction-tuned multilingual models achieve the strongest performance, while existing Slovak-specific models trained for NLU tasks transfer poorly to embedding tasks. To address the need for efficient, locally-deployable Slovak embeddings, the authors developed "e5-sk-small" (45M parameters) and "e5-sk-large" (365M parameters) by applying vocabulary trimming and fine-tuning to Multilingual E5 models. These open-source models achieve competitive performance with proprietary APIs, despite size reductions up to 62%, and remain locally deployable for semantic search and retrieval-augmented generation (RAG).
Key takeaway
For NLP engineers and researchers working with low-resource languages like Slovak, this work provides a clear path to developing high-performance, locally-deployable text embeddings. You should consider adapting the vocabulary trimming and fine-tuning approach on multilingual models to create efficient, specialized embeddings for your target language. This strategy enables competitive performance for applications like semantic search and RAG without relying on proprietary APIs.
Key insights
Specialized text embedding models can be effectively adapted for low-resource languages using existing multilingual foundations.
Principles
- Large instruction-tuned multilingual models excel in embedding tasks.
- NLU-trained models transfer poorly to embedding tasks.
- Vocabulary trimming significantly reduces model size.
Method
Develop efficient, locally-deployable embeddings by applying vocabulary trimming and fine-tuning to larger multilingual models like E5.
In practice
- Deploy "e5-sk-small" or "e5-sk-large" for Slovak semantic search.
- Integrate these models into RAG systems for Slovak content.
- Adapt this approach for other under-resourced languages.
Topics
- SkMTEB
- Text Embeddings
- Slovak Language
- Low-Resource NLP
- Multilingual E5
- Retrieval-Augmented Generation
- Semantic Search
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.