Tokenizer Fertility and Zero-Shot Performance of Foundation Models on Ukrainian Legal Text: A Comparative Study

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

A comparative study benchmarked seven foundation models from five providers on 273 Ukrainian court decisions from the EDRSR, focusing on tokenizer fertility and zero-shot performance across three tasks. The research found that tokenizer fertility varied by 1.6x, with Qwen3 models consuming 60% more tokens than Llama-family models for the same input, directly impacting API costs. NVIDIA Nemotron Super 3 (120B) achieved the highest composite score of 83.1, surpassing Mistral Large 3 (675B total, 41B active), despite Mistral having significantly more parameters and costing three times as much via API. Additionally, few-shot prompting consistently degraded performance by up to 26 percentage points, a finding confirmed by ablations for Ukrainian-language demonstrations.

Key takeaway

For AI/ML teams evaluating foundation models for Ukrainian legal text processing, your model selection process should critically include tokenizer efficiency analysis to manage API costs. You should also default to zero-shot prompting, as few-shot examples can significantly degrade performance for morphologically rich languages like Ukrainian, contrary to common intuition. This approach will optimize both cost and accuracy.

Key insights

Tokenizer efficiency and zero-shot performance vary significantly across foundation models on Ukrainian legal text.

Principles

Method

Benchmarking models on Ukrainian legal text, measuring tokenizer fertility and zero-shot performance, followed by stratified and prompt-sensitivity ablations.

In practice

Topics

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.