SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation
Summary
SkMTEB introduces the first comprehensive MTEB-style text embedding benchmark for Slovak, a low-resource West Slavic language. This benchmark comprises 31 datasets across 7 task types, offering nearly four times the depth of existing multilingual coverage for Slovak. Evaluation of 31 embedding models revealed that large instruction-tuned multilingual models achieve the strongest performance, while existing Slovak-specific NLU models transfer poorly. To address the need for efficient, locally-deployable solutions, the authors developed e5-sk-small (45M parameters) and e5-sk-large (365M parameters) by applying vocabulary trimming and fine-tuning to Multilingual E5 models. These open-source models achieve competitive performance with proprietary APIs, despite size reductions of up to 62%, making them suitable for semantic search and retrieval-augmented generation (RAG). The benchmark, models, datasets, and code are openly released.
Key takeaway
For machine learning engineers developing NLP solutions for low-resource languages like Slovak, this work provides a clear path to efficient, high-performing text embeddings. You should consider adopting the SkMTEB benchmark for robust evaluation and explore vocabulary trimming combined with targeted fine-tuning of models like Multilingual E5. This approach enables creating compact, locally deployable models (e.g., 45M parameters) that rival proprietary APIs for tasks like semantic search and RAG, significantly reducing deployment costs and latency.
Key insights
SkMTEB provides a robust benchmark and efficient, language-specific embedding models for Slovak, a low-resource language.
Principles
- NLU-tuned models underperform on embedding tasks.
- Vocabulary trimming reduces model size with minimal performance loss.
- Large models show diminishing returns for single-language embeddings.
Method
Adapt Multilingual E5 models by trimming vocabulary to 60K tokens based on target language frequency, then fine-tune on high-quality, curated language-specific datasets.
In practice
- Use vocabulary trimming for compact language models.
- Fine-tune on curated data for specific language tasks.
- Prepend "query:"/"passage:" for E5 models.
Topics
- Slovak NLP
- Text Embeddings
- Low-Resource Languages
- MTEB Benchmark
- Vocabulary Trimming
- Multilingual E5
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.