Designing a Synthetic Data Pipeline for Persian LLM Fine Tuning

2026-06-22 · Source: LLM on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Software Development & Engineering · Depth: Intermediate, short

Summary

This project introduces a comprehensive synthetic data pipeline designed to address the scarcity and quality issues of instruction datasets for low-resource languages like Persian. The system, which includes structured topic generation, multi-layer filtering, and QLoRA fine-tuning, aims to improve Large Language Model (LLM) performance. It utilizes a topic tree with 51 domains and approximately 350 subtopics for controlled diversity, generating around 4,000 instruction pairs. Data undergoes semantic deduplication using embedding similarity (threshold > 0.75) and LLM-based quality scoring for fluency, relevance, and completeness, retaining samples with an average score above 3.5 out of 5. The curated dataset is then used to fine-tune Qwen2.5 3B Instruct via QLoRA on Google Colab T4 over 3 epochs. Evaluation demonstrates that the fine-tuned model produces fluent, consistent Persian output with improved instruction adherence, contrasting with the base model's issues. This highlights data engineering's critical role over model scaling.

Key takeaway

For AI Engineers fine-tuning LLMs in low-resource languages like Persian, prioritize data engineering over model size. Your focus should be on building robust synthetic data pipelines that incorporate structured topic generation, semantic deduplication, and LLM-based quality scoring. This approach, demonstrated to yield significant performance improvements with just 4,000 curated samples, will enable your models to achieve fluent, instruction-following outputs without relying on massive datasets or larger base models.

Key insights

Data quality and engineering are paramount for LLM performance, especially in low-resource languages.

Principles

Data quality is the primary bottleneck for LLM performance.
Dual filtering (semantic, LLM-based) is essential for dataset quality.
Structured topic graphs ensure better coverage and diversity.

Method

The pipeline involves structured topic tree generation, LLM-based data creation, semantic deduplication, LLM-as-a-judge quality scoring, dataset export, and QLoRA fine-tuning.

In practice

Use topic trees for controlled diversity in synthetic data generation.
Implement embedding-based semantic deduplication (e.g., similarity > 0.75).
Employ a second LLM for automated quality scoring (fluency, relevance, completeness).

Topics

Persian LLM Fine-tuning
Synthetic Data Generation
Low-Resource Languages
Data Quality Engineering
QLoRA
Instruction Following Models

Code references

MohammadHeydari/FarsiSyntheticData

Best for: AI Engineer, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by LLM on Medium.