Bytes Speak All Languages: Cross-Script Name Retrieval via Contrastive Learning

2026-04-26 · Source: Towards Data Science · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, long

Summary

A new compact transformer encoder, trained from scratch on raw UTF-8 bytes without a tokenizer or pretrained backbone, addresses the silent failure mode of cross-script name matching in sanctions screening and other systems. Traditional methods like edit distance and Soundex fail when names like "Владимир Путин" (Russian) are queried against "Vladimir Putin" (Latin) due to disjoint character sets and romanization ambiguities. This byte-level encoder achieved 0.775 MRR and 0.897 R@10 across 8 non-Latin scripts, reducing the performance gap between Latin and non-Latin queries by 10x over classical baselines. The model, with ~4M parameters, was trained using InfoNCE loss and hard negative mining on a 4.67 million-pair dataset generated by a 4-stage LLM pipeline, demonstrating a robust solution for phonetic name retrieval across diverse scripts.

Key takeaway

For NLP engineers building cross-script name matching or entity resolution systems, consider adopting a byte-level transformer encoder. This approach significantly outperforms classical methods on non-Latin queries and reduces the script gap by 10x, offering a robust solution for challenges like ambiguous romanization. You should also explore LLM-powered data generation pipelines to create large-scale training datasets for low-resource languages, and implement ANCE hard negative mining to improve model performance on phonetically similar names.

Key insights

A byte-level transformer encoder effectively solves cross-script phonetic name retrieval by learning universal byte sequence mappings.

Principles

Byte-level tokenization eliminates OOV tokens for multilingual tasks.
Romanization is not a deterministic function.
Names lack semantic context for dense retrieval.

Method

A 4-stage LLM pipeline generates 4.67 million cross-script phonetic name pairs. A 6-layer byte-level transformer encoder is trained from scratch using InfoNCE loss and ANCE hard negative mining.

In practice

Use byte-level tokenization for surface-form matching tasks.
Employ LLMs for synthetic data generation in low-resource scenarios.
Implement ANCE hard negative mining to sharpen embedding spaces.

Topics

Cross-Script Name Retrieval
Contrastive Learning
Byte-Level Encoder
LLM Data Generation
Hard Negative Mining

Code references

vedant-jumle/cross-language-phonetic-text-alignment

Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.