SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation

2026-06-11 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Software Development & Engineering · Depth: Advanced, quick

Summary

SkMTEB introduces the first comprehensive MTEB-style text embedding benchmark for Slovak, a low-resource West Slavic language, featuring 31 datasets across 7 task types, nearly four times the depth of existing multilingual benchmarks for Slovak. Evaluation of 31 embedding models revealed that large instruction-tuned multilingual models achieve the strongest performance, while existing Slovak-specific models trained for NLU tasks transfer poorly to embedding tasks. To address the need for efficient, locally-deployable Slovak embeddings, the authors developed "e5-sk-small" (45M parameters) and "e5-sk-large" (365M parameters) by applying vocabulary trimming and fine-tuning to Multilingual E5 models. These open-source models achieve competitive performance with proprietary APIs, despite size reductions up to 62%, and remain locally deployable for semantic search and retrieval-augmented generation (RAG).

Key takeaway

For NLP engineers and researchers working with low-resource languages like Slovak, this work provides a clear path to developing high-performance, locally-deployable text embeddings. You should consider adapting the vocabulary trimming and fine-tuning approach on multilingual models to create efficient, specialized embeddings for your target language. This strategy enables competitive performance for applications like semantic search and RAG without relying on proprietary APIs.

Key insights

Specialized text embedding models can be effectively adapted for low-resource languages using existing multilingual foundations.

Principles

Large instruction-tuned multilingual models excel in embedding tasks.
NLU-trained models transfer poorly to embedding tasks.
Vocabulary trimming significantly reduces model size.

Method

Develop efficient, locally-deployable embeddings by applying vocabulary trimming and fine-tuning to larger multilingual models like E5.

In practice

Deploy "e5-sk-small" or "e5-sk-large" for Slovak semantic search.
Integrate these models into RAG systems for Slovak content.
Adapt this approach for other under-resourced languages.

Topics

SkMTEB
Text Embeddings
Slovak Language
Low-Resource NLP
Multilingual E5
Retrieval-Augmented Generation
Semantic Search

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.