Training Azerbaijani language models on Amazon SageMaker AI

2026-05-28 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Advanced, long

Summary

Azercell Telecom LLC, in collaboration with the AWS Generative AI Innovation Center, developed a production-ready framework on Amazon SageMaker AI for training Azerbaijani large language models. This six-week project addressed the challenge of adapting foundation models to a morphologically rich, low-resource language. The solution achieved a 23% higher training throughput and 58% lower peak GPU memory usage on an ml.p5.48xlarge instance. Key components include a custom monolingual tokenizer, which doubled encoding efficiency from 3.22 to 1.59 tokens per word, effectively doubling the model's context window capacity for Azerbaijani text. The framework also utilized continued pre-training of Llama 3.2 1B with PyTorch FSDP and Liger Kernel optimizations, reducing per-GPU memory from 9.23 GB to 1.17 GB. Supervised fine-tuning with LoRA then transformed the model into a coherent conversational assistant.

Key takeaway

For NLP Engineers developing LLMs for low-resource or morphologically rich languages, you should prioritize custom tokenizer development to double context window capacity. Implement PyTorch FSDP and Liger Kernels on Amazon SageMaker AI to achieve significant GPU memory savings and up to 23% higher training throughput. This approach enables efficient adaptation of foundation models and scalable deployment for conversational AI applications.

Key insights

Optimizing LLM training for low-resource, morphologically rich languages requires custom tokenization and GPU memory optimizations.

Principles

Custom tokenizers significantly improve encoding efficiency for complex languages.
FSDP and kernel optimizations reduce GPU memory and boost throughput.
Two-phase CPT adapts embeddings then fine-tunes full model.

Method

Develop custom tokenizer (BBPE), then perform two-phase continued pre-training (embedding adaptation, full training) with FSDP and Liger Kernels, followed by LoRA-based supervised fine-tuning.

In practice

Train custom Byte-Level BPE tokenizers for morphologically rich languages.
Implement PyTorch FSDP for distributed training memory efficiency.
Integrate Liger Kernels to fuse operations and reduce GPU memory.

Topics

Azerbaijani Language Models
Amazon SageMaker AI
Custom Tokenization
Low-Resource Languages
PyTorch FSDP
Liger Kernels
LoRA Fine-tuning

Code references

linkedin/Liger-Kernel

Best for: Machine Learning Engineer, NLP Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.