IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources

2026-06-19 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

IHUBERT is a new monolingual Persian pretrained language model, built from scratch using the RoBERTa-base encoder with 125 million parameters. It was trained on a 45 GB curated subset of the Sepahr-Danesh collection, comprising approximately 7-8 billion tokens. The pretraining corpus underwent a multi-stage preprocessing pipeline, including normalization, exact and near-duplicate removal, anonymization, and vector-database-based semantic deduplication to ensure domain and register balance. A custom 139k-vocabulary BPE tokenizer was also developed to better handle Persian morphology. Evaluated across seven Persian NLU benchmarks, IHUBERT achieved significant improvements in extractive question answering, scoring 88.3542 F1 on PQuAD and 49.0987 F1 on ParsiNLU-RC, and recorded the best result on FarsTail with a Macro-F1 of 0.8350. It remained competitive on NER and topic classification tasks, such as 0.8308 F1 on ParsTwiNER, while relation extraction showed a remaining gap at 0.6684 Macro-F1 on PERLEX.

Key takeaway

For NLP Engineers developing Persian language models, IHUBERT demonstrates that investing in rigorous corpus preprocessing and custom tokenization yields substantial performance gains. If you are curating pretraining data, prioritize vector-database-based semantic deduplication to achieve domain balance and reduce redundancy. Additionally, consider training a custom BPE tokenizer to better capture your target language's unique morphological features, especially for comprehension-oriented tasks like question answering. This approach can significantly improve your model's NLU capabilities.

Key insights

Vector-based semantic deduplication and domain-balanced pretraining significantly enhance Persian language model performance across diverse NLU tasks.

Principles

High-quality, domain-balanced corpora are crucial for PLM performance.
Vector-database semantic deduplication improves corpus quality and distribution.
Custom BPE tokenizers better capture language-specific morphology.

Method

Implement a multi-stage preprocessing pipeline including normalization, duplicate removal, anonymization, and vector-database semantic deduplication for corpus balancing.

In practice

Utilize vector-database semantic deduplication for corpus domain balancing.
Develop custom BPE tokenizers for morphologically complex languages.
Broadly evaluate PLMs on comprehension-oriented NLU tasks.

Topics

Persian Language Models
Semantic Deduplication
Pretraining Corpora
NLU Benchmarks
BPE Tokenization
IHUBERT

Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.