SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, extended

Summary

SkMTEB introduces the first comprehensive MTEB-style text embedding benchmark for Slovak, a low-resource West Slavic language. This benchmark comprises 31 datasets across 7 task types, offering nearly four times the depth of existing multilingual coverage for Slovak. Evaluation of 31 embedding models revealed that large instruction-tuned multilingual models achieve the strongest performance, while existing Slovak-specific NLU models transfer poorly. To address the need for efficient, locally-deployable solutions, the authors developed e5-sk-small (45M parameters) and e5-sk-large (365M parameters) by applying vocabulary trimming and fine-tuning to Multilingual E5 models. These open-source models achieve competitive performance with proprietary APIs, despite size reductions of up to 62%, making them suitable for semantic search and retrieval-augmented generation (RAG). The benchmark, models, datasets, and code are openly released.

Key takeaway

For machine learning engineers developing NLP solutions for low-resource languages like Slovak, this work provides a clear path to efficient, high-performing text embeddings. You should consider adopting the SkMTEB benchmark for robust evaluation and explore vocabulary trimming combined with targeted fine-tuning of models like Multilingual E5. This approach enables creating compact, locally deployable models (e.g., 45M parameters) that rival proprietary APIs for tasks like semantic search and RAG, significantly reducing deployment costs and latency.

Key insights

SkMTEB provides a robust benchmark and efficient, language-specific embedding models for Slovak, a low-resource language.

Principles

NLU-tuned models underperform on embedding tasks.
Vocabulary trimming reduces model size with minimal performance loss.
Large models show diminishing returns for single-language embeddings.

Method

Adapt Multilingual E5 models by trimming vocabulary to 60K tokens based on target language frequency, then fine-tune on high-quality, curated language-specific datasets.

In practice

Use vocabulary trimming for compact language models.
Fine-tune on curated data for specific language tasks.
Prepend "query:"/"passage:" for E5 models.

Topics

Slovak NLP
Text Embeddings
Low-Resource Languages
MTEB Benchmark
Vocabulary Trimming
Multilingual E5

Code references

slovak-nlp/skmteb

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.