Training Azerbaijani language models on Amazon SageMaker AI
Summary
Azercell Telecom LLC, in collaboration with the AWS Generative AI Innovation Center, developed a production-ready framework on Amazon SageMaker AI for training Azerbaijani large language models. This six-week project addressed the challenge of adapting foundation models to a morphologically rich, low-resource language. The solution achieved a 23% higher training throughput and 58% lower peak GPU memory usage on an ml.p5.48xlarge instance. Key components include a custom monolingual tokenizer, which doubled encoding efficiency from 3.22 to 1.59 tokens per word, effectively doubling the model's context window capacity for Azerbaijani text. The framework also utilized continued pre-training of Llama 3.2 1B with PyTorch FSDP and Liger Kernel optimizations, reducing per-GPU memory from 9.23 GB to 1.17 GB. Supervised fine-tuning with LoRA then transformed the model into a coherent conversational assistant.
Key takeaway
For NLP Engineers developing LLMs for low-resource or morphologically rich languages, you should prioritize custom tokenizer development to double context window capacity. Implement PyTorch FSDP and Liger Kernels on Amazon SageMaker AI to achieve significant GPU memory savings and up to 23% higher training throughput. This approach enables efficient adaptation of foundation models and scalable deployment for conversational AI applications.
Key insights
Optimizing LLM training for low-resource, morphologically rich languages requires custom tokenization and GPU memory optimizations.
Principles
- Custom tokenizers significantly improve encoding efficiency for complex languages.
- FSDP and kernel optimizations reduce GPU memory and boost throughput.
- Two-phase CPT adapts embeddings then fine-tunes full model.
Method
Develop custom tokenizer (BBPE), then perform two-phase continued pre-training (embedding adaptation, full training) with FSDP and Liger Kernels, followed by LoRA-based supervised fine-tuning.
In practice
- Train custom Byte-Level BPE tokenizers for morphologically rich languages.
- Implement PyTorch FSDP for distributed training memory efficiency.
- Integrate Liger Kernels to fuse operations and reduce GPU memory.
Topics
- Azerbaijani Language Models
- Amazon SageMaker AI
- Custom Tokenization
- Low-Resource Languages
- PyTorch FSDP
- Liger Kernels
- LoRA Fine-tuning
Code references
Best for: Machine Learning Engineer, NLP Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.