IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources
Summary
IHUBERT is a new monolingual Persian pretrained language model, built from scratch using the RoBERTa-base encoder with 125 million parameters. It was trained on a 45 GB curated subset of the Sepahr-Danesh collection, comprising approximately 7-8 billion tokens. The pretraining corpus underwent a multi-stage preprocessing pipeline, including normalization, exact and near-duplicate removal, anonymization, and vector-database-based semantic deduplication to ensure domain and register balance. A custom 139k-vocabulary BPE tokenizer was also developed to better handle Persian morphology. Evaluated across seven Persian NLU benchmarks, IHUBERT achieved significant improvements in extractive question answering, scoring 88.3542 F1 on PQuAD and 49.0987 F1 on ParsiNLU-RC, and recorded the best result on FarsTail with a Macro-F1 of 0.8350. It remained competitive on NER and topic classification tasks, such as 0.8308 F1 on ParsTwiNER, while relation extraction showed a remaining gap at 0.6684 Macro-F1 on PERLEX.
Key takeaway
For NLP Engineers developing Persian language models, IHUBERT demonstrates that investing in rigorous corpus preprocessing and custom tokenization yields substantial performance gains. If you are curating pretraining data, prioritize vector-database-based semantic deduplication to achieve domain balance and reduce redundancy. Additionally, consider training a custom BPE tokenizer to better capture your target language's unique morphological features, especially for comprehension-oriented tasks like question answering. This approach can significantly improve your model's NLU capabilities.
Key insights
Vector-based semantic deduplication and domain-balanced pretraining significantly enhance Persian language model performance across diverse NLU tasks.
Principles
- High-quality, domain-balanced corpora are crucial for PLM performance.
- Vector-database semantic deduplication improves corpus quality and distribution.
- Custom BPE tokenizers better capture language-specific morphology.
Method
Implement a multi-stage preprocessing pipeline including normalization, duplicate removal, anonymization, and vector-database semantic deduplication for corpus balancing.
In practice
- Utilize vector-database semantic deduplication for corpus domain balancing.
- Develop custom BPE tokenizers for morphologically complex languages.
- Broadly evaluate PLMs on comprehension-oriented NLU tasks.
Topics
- Persian Language Models
- Semantic Deduplication
- Pretraining Corpora
- NLU Benchmarks
- BPE Tokenization
- IHUBERT
Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.