MiqraBERT: Regression-Based Sentence-BERT Finetuning for Biblical Hebrew Parallel Detection

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

MiqraBERT is a Sentence-BERT model specifically finetuned from AlephBERT to detect verse-level semantic similarity in Biblical Hebrew. This model addresses limitations of traditional lexical overlap methods, which struggle with paraphrased or syntactically reworked textual reuse. Trained on 1,650 labeled verse and half-verse pairs, including 825 true parallels and 825 randomly sampled negatives, MiqraBERT uses cosine-similarity regression to create an embedding space where parallel verses cluster. Evaluation using Wasserstein distance and the overlap coefficient across ten random seeds shows a 2.7-fold improvement in distributional separation over the pre-trained baseline, reducing ambiguous overlap from approximately 24% to about 6%. While achieving 87.1% recall@10 for narrative synoptic parallels, its performance for poetic parallels remains below 9%, limiting its reliable application to narrative textual reuse. MiqraBERT is publicly available on Hugging Face.

Key takeaway

For NLP Engineers developing textual reuse detection systems for ancient or low-resource languages, you should recognize that lexical overlap methods are often inadequate. MiqraBERT demonstrates that finetuning a BERT-based model with regression for semantic similarity can significantly improve detection, particularly for narrative texts. You should consider adopting this methodology, carefully evaluating genre-specific performance, and explore the publicly available MiqraBERT for Biblical Hebrew applications to enhance your analytical capabilities.

Key insights

Finetuning a BERT model with regression for semantic similarity effectively detects textual reuse in ancient languages beyond lexical overlap.

Principles

Method

Finetune a pre-trained encoder using cosine-similarity regression on balanced true/negative parallel pairs to learn a semantic embedding space.

In practice

Topics

Best for: NLP Engineer, AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.