IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources

2026-06-18 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Natural Language Processing · Depth: Expert, quick

Summary

IHUBERT is a monolingual Persian Pretrained Language Model (PLM) developed using a RoBERTa-base encoder with 125M parameters. It was trained from scratch on a 45 GB curated subset of the Sepahr-Danesh collection, comprising approximately 7-8 billion tokens. The model incorporates a multi-stage preprocessing pipeline for corpus quality, including normalization, exact and near-duplicate removal, anonymization, and vector-database-based semantic deduplication to balance domain distribution. A custom 139k-vocabulary BPE tokenizer was also trained to better capture Persian morphology. IHUBERT was evaluated across seven Persian NLU benchmarks, achieving strong gains on extractive QA tasks, ranking first on PQuAD (F1 88.3542) and ParsiNLU-RC (F1 49.0987), and the best result on FarsTail (Macro-F1 0.8350). It remains competitive on NER and topic classification, with relation extraction identified as a remaining gap.

Key takeaway

For NLP Engineers developing language models for low-resource or morphologically rich languages, this work highlights the impact of meticulous corpus curation. You should consider adopting multi-stage preprocessing, including semantic deduplication and domain balancing, alongside custom BPE tokenization. This approach, as demonstrated by IHUBERT, can significantly boost model performance, particularly on complex comprehension-oriented tasks like extractive question answering, improving overall NLU capabilities.

Key insights

Semantic deduplication and domain-balanced pretraining significantly enhance Persian PLM performance across diverse NLU tasks.

Principles

Corpus quality and redundancy reduction are crucial for PLM pretraining.
Vector-database-based semantic deduplication can balance domain distribution.
Custom BPE tokenizers can better capture language-specific morphology.

Method

A multi-stage preprocessing pipeline involving normalization, exact/near-duplicate removal, anonymization, and vector-database-based semantic deduplication for distribution balancing.

In practice

Implement vector-based semantic deduplication for corpus curation.
Train custom BPE tokenizers for morphologically rich languages.
Evaluate PLMs on comprehension-oriented tasks like QA.

Topics

Persian Language Models
Semantic Deduplication
Pretraining Corpora
RoBERTa
BPE Tokenization
Natural Language Understanding
Extractive Question Answering

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.