IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources
Summary
IHUBERT is a monolingual Persian Pretrained Language Model (PLM) developed using a RoBERTa-base encoder with 125M parameters. It was trained from scratch on a 45 GB curated subset of the Sepahr-Danesh collection, comprising approximately 7-8 billion tokens. The model incorporates a multi-stage preprocessing pipeline for corpus quality, including normalization, exact and near-duplicate removal, anonymization, and vector-database-based semantic deduplication to balance domain distribution. A custom 139k-vocabulary BPE tokenizer was also trained to better capture Persian morphology. IHUBERT was evaluated across seven Persian NLU benchmarks, achieving strong gains on extractive QA tasks, ranking first on PQuAD (F1 88.3542) and ParsiNLU-RC (F1 49.0987), and the best result on FarsTail (Macro-F1 0.8350). It remains competitive on NER and topic classification, with relation extraction identified as a remaining gap.
Key takeaway
For NLP Engineers developing language models for low-resource or morphologically rich languages, this work highlights the impact of meticulous corpus curation. You should consider adopting multi-stage preprocessing, including semantic deduplication and domain balancing, alongside custom BPE tokenization. This approach, as demonstrated by IHUBERT, can significantly boost model performance, particularly on complex comprehension-oriented tasks like extractive question answering, improving overall NLU capabilities.
Key insights
Semantic deduplication and domain-balanced pretraining significantly enhance Persian PLM performance across diverse NLU tasks.
Principles
- Corpus quality and redundancy reduction are crucial for PLM pretraining.
- Vector-database-based semantic deduplication can balance domain distribution.
- Custom BPE tokenizers can better capture language-specific morphology.
Method
A multi-stage preprocessing pipeline involving normalization, exact/near-duplicate removal, anonymization, and vector-database-based semantic deduplication for distribution balancing.
In practice
- Implement vector-based semantic deduplication for corpus curation.
- Train custom BPE tokenizers for morphologically rich languages.
- Evaluate PLMs on comprehension-oriented tasks like QA.
Topics
- Persian Language Models
- Semantic Deduplication
- Pretraining Corpora
- RoBERTa
- BPE Tokenization
- Natural Language Understanding
- Extractive Question Answering
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.