MiqraBERT: Regression-Based Sentence-BERT Finetuning for Biblical Hebrew Parallel Detection
Summary
MiqraBERT is a Sentence-BERT model specifically finetuned from AlephBERT to detect verse-level semantic similarity in Biblical Hebrew. This model addresses limitations of traditional lexical overlap methods, which struggle with paraphrased or syntactically reworked textual reuse. Trained on 1,650 labeled verse and half-verse pairs, including 825 true parallels and 825 randomly sampled negatives, MiqraBERT uses cosine-similarity regression to create an embedding space where parallel verses cluster. Evaluation using Wasserstein distance and the overlap coefficient across ten random seeds shows a 2.7-fold improvement in distributional separation over the pre-trained baseline, reducing ambiguous overlap from approximately 24% to about 6%. While achieving 87.1% recall@10 for narrative synoptic parallels, its performance for poetic parallels remains below 9%, limiting its reliable application to narrative textual reuse. MiqraBERT is publicly available on Hugging Face.
Key takeaway
For NLP Engineers developing textual reuse detection systems for ancient or low-resource languages, you should recognize that lexical overlap methods are often inadequate. MiqraBERT demonstrates that finetuning a BERT-based model with regression for semantic similarity can significantly improve detection, particularly for narrative texts. You should consider adopting this methodology, carefully evaluating genre-specific performance, and explore the publicly available MiqraBERT for Biblical Hebrew applications to enhance your analytical capabilities.
Key insights
Finetuning a BERT model with regression for semantic similarity effectively detects textual reuse in ancient languages beyond lexical overlap.
Principles
- Lexical overlap methods fail with paraphrase.
- Regression finetunes embeddings for similarity.
- Genre significantly impacts detection accuracy.
Method
Finetune a pre-trained encoder using cosine-similarity regression on balanced true/negative parallel pairs to learn a semantic embedding space.
In practice
- Apply MiqraBERT for Biblical Hebrew narrative reuse.
- Tailor NLP models to specific text genres.
- Use distributional metrics for semantic evaluation.
Topics
- MiqraBERT
- Sentence-BERT
- Biblical Hebrew
- Textual Reuse Detection
- Semantic Similarity
- Regression Finetuning
Best for: NLP Engineer, AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.