Bytes Speak All Languages: Cross-Script Name Retrieval via Contrastive Learning
Summary
A new compact transformer encoder, trained from scratch on raw UTF-8 bytes without a tokenizer or pretrained backbone, addresses the silent failure mode of cross-script name matching in sanctions screening and other systems. Traditional methods like edit distance and Soundex fail when names like "Владимир Путин" (Russian) are queried against "Vladimir Putin" (Latin) due to disjoint character sets and romanization ambiguities. This byte-level encoder achieved 0.775 MRR and 0.897 R@10 across 8 non-Latin scripts, reducing the performance gap between Latin and non-Latin queries by 10x over classical baselines. The model, with ~4M parameters, was trained using InfoNCE loss and hard negative mining on a 4.67 million-pair dataset generated by a 4-stage LLM pipeline, demonstrating a robust solution for phonetic name retrieval across diverse scripts.
Key takeaway
For NLP engineers building cross-script name matching or entity resolution systems, consider adopting a byte-level transformer encoder. This approach significantly outperforms classical methods on non-Latin queries and reduces the script gap by 10x, offering a robust solution for challenges like ambiguous romanization. You should also explore LLM-powered data generation pipelines to create large-scale training datasets for low-resource languages, and implement ANCE hard negative mining to improve model performance on phonetically similar names.
Key insights
A byte-level transformer encoder effectively solves cross-script phonetic name retrieval by learning universal byte sequence mappings.
Principles
- Byte-level tokenization eliminates OOV tokens for multilingual tasks.
- Romanization is not a deterministic function.
- Names lack semantic context for dense retrieval.
Method
A 4-stage LLM pipeline generates 4.67 million cross-script phonetic name pairs. A 6-layer byte-level transformer encoder is trained from scratch using InfoNCE loss and ANCE hard negative mining.
In practice
- Use byte-level tokenization for surface-form matching tasks.
- Employ LLMs for synthetic data generation in low-resource scenarios.
- Implement ANCE hard negative mining to sharpen embedding spaces.
Topics
- Cross-Script Name Retrieval
- Contrastive Learning
- Byte-Level Encoder
- LLM Data Generation
- Hard Negative Mining
Code references
Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.