SomaliWeb v1: A Quality-Filtered Somali Web Corpus with a Matched Tokenizer and a Public Language-Identification Benchmark
Summary
SomaliWeb v1 is a newly released, quality-filtered Somali web corpus comprising 819,322 documents and approximately 303 million tokens. This corpus was constructed from HPLT v2, CC100, and Somali Wikipedia using a six-stage reproducible pipeline. It addresses the lack of publicly documented dedicated Somali pretraining corpora, companion tokenizers, and language-identification benchmarks. Alongside the corpus, the release includes a matched BPE-16K tokenizer and the first public benchmark for three production language identifiers. Analysis of existing distributions revealed significant quality issues: HPLT v2's "cleaned" Somali data contained 17.3% byte-exact duplicates, 56.1% fixable mojibake, and 10.7% near-duplicates. The new BPE-16K tokenizer reduces token count by 40.2% compared to GPT-4's cl100k_base on FLORES-200 Somali devtest.
Key takeaway
For research scientists developing Somali natural language processing models, you should integrate SomaliWeb v1 and its BPE-16K tokenizer into your pretraining workflows. This resource offers a significantly cleaner and more efficient dataset than existing multilingual distributions, potentially improving model performance and reducing computational costs due to fewer tokens.
Key insights
SomaliWeb v1 provides a high-quality Somali corpus, tokenizer, and language-ID benchmark, addressing critical resource gaps.
Principles
- Corpus quality impacts downstream model performance.
- Reproducible pipelines enhance data reliability.
Method
A six-stage pipeline filters and processes web data from multiple sources to create a clean, dedicated language corpus.
In practice
- Use SomaliWeb v1 for Somali NLP pretraining.
- Evaluate language ID tools with the new benchmark.
Topics
- Somali Language Corpus
- BPE Tokenizer
- Language Identification
- Data Quality Analysis
- Natural Language Processing
Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.