SomaliWeb v1: A Quality-Filtered Somali Web Corpus with a Matched Tokenizer and a Public Language-Identification Benchmark

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, quick

Summary

SomaliWeb v1 is a newly released, quality-filtered Somali web corpus comprising 819,322 documents and approximately 303 million tokens. This corpus was constructed from HPLT v2, CC100, and Somali Wikipedia using a six-stage reproducible pipeline. It addresses the lack of publicly documented dedicated Somali pretraining corpora, companion tokenizers, and language-identification benchmarks. Alongside the corpus, the release includes a matched BPE-16K tokenizer and the first public benchmark for three production language identifiers. Analysis of existing distributions revealed significant quality issues: HPLT v2's "cleaned" Somali data contained 17.3% byte-exact duplicates, 56.1% fixable mojibake, and 10.7% near-duplicates. The new BPE-16K tokenizer reduces token count by 40.2% compared to GPT-4's cl100k_base on FLORES-200 Somali devtest.

Key takeaway

For research scientists developing Somali natural language processing models, you should integrate SomaliWeb v1 and its BPE-16K tokenizer into your pretraining workflows. This resource offers a significantly cleaner and more efficient dataset than existing multilingual distributions, potentially improving model performance and reducing computational costs due to fewer tokens.

Key insights

SomaliWeb v1 provides a high-quality Somali corpus, tokenizer, and language-ID benchmark, addressing critical resource gaps.

Principles

Method

A six-stage pipeline filters and processes web data from multiple sources to create a clean, dedicated language corpus.

In practice

Topics

Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.