IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Natural Language Processing · Depth: Expert, quick

Summary

IHUBERT is a monolingual Persian Pretrained Language Model (PLM) developed using a RoBERTa-base encoder with 125M parameters. It was trained from scratch on a 45 GB curated subset of the Sepahr-Danesh collection, comprising approximately 7-8 billion tokens. The model incorporates a multi-stage preprocessing pipeline for corpus quality, including normalization, exact and near-duplicate removal, anonymization, and vector-database-based semantic deduplication to balance domain distribution. A custom 139k-vocabulary BPE tokenizer was also trained to better capture Persian morphology. IHUBERT was evaluated across seven Persian NLU benchmarks, achieving strong gains on extractive QA tasks, ranking first on PQuAD (F1 88.3542) and ParsiNLU-RC (F1 49.0987), and the best result on FarsTail (Macro-F1 0.8350). It remains competitive on NER and topic classification, with relation extraction identified as a remaining gap.

Key takeaway

For NLP Engineers developing language models for low-resource or morphologically rich languages, this work highlights the impact of meticulous corpus curation. You should consider adopting multi-stage preprocessing, including semantic deduplication and domain balancing, alongside custom BPE tokenization. This approach, as demonstrated by IHUBERT, can significantly boost model performance, particularly on complex comprehension-oriented tasks like extractive question answering, improving overall NLU capabilities.

Key insights

Semantic deduplication and domain-balanced pretraining significantly enhance Persian PLM performance across diverse NLU tasks.

Principles

Method

A multi-stage preprocessing pipeline involving normalization, exact/near-duplicate removal, anonymization, and vector-database-based semantic deduplication for distribution balancing.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.