No Resource, No Benchmarks, No Problem? Evaluating and Improving LLMs for Code Generation in No-Resource Languages
Summary
This study addresses the challenge of Large Language Models (LLMs) generating code for "no-resource languages" like Gleam and MoonBit, which lack sufficient training data. Researchers developed and released three code generation benchmarks (HumanEval, MBPP, McEval-Hard) translated into these languages. Initial evaluations of four LLMs (GPT-4o, o3-mini, Qwen 2.5 Coder 32B Instruct, Qwen 3 32B Instruct) showed extremely low pass@1 scores, often 0-1% for complex tasks, primarily due to syntactic errors. The most effective technique found was further pre-training a base LLM on the limited available no-resource language data, boosting pass@1 up to 26% on McEval-Hard for MoonBit. To retain instruction-following capabilities, an instruction transferring method was introduced, which injects instruction weights from an instruct model into the pre-trained base model. This approach achieved pass@1 scores over 25% for Gleam and 33% for MoonBit on McEval-Hard, enabling cost-effective deployment of specialized coding assistants.
Key takeaway
For AI Engineers tasked with developing code generation solutions for proprietary or no-resource languages, you should prioritize the instruction transferring technique. This method allows you to pre-train a base model on limited language data and then cheaply inject instruction-following capabilities, achieving significantly higher performance (e.g., 33% pass@1 on McEval-Hard for MoonBit) than traditional fine-tuning or in-context learning. This approach enables cost-effective deployment of specialized coding assistants, even with smaller models.
Key insights
Instruction transferring on pre-trained base models effectively specializes LLMs for no-resource languages without costly instruction fine-tuning.
Principles
- LLM code generation performance strongly depends on language resource availability.
- Further pre-training base models significantly boosts no-resource language proficiency.
- Weight diff transfer efficiently injects instruction-following into specialized base models.
Method
Pre-train a base LLM on target no-resource language data, then transfer instruction-following capabilities by adding the weight difference from an instruction-tuned model.
In practice
- Deploy in-house LLM code assistants for proprietary languages.
- Apply weight diff transfer to specialize open-source base models.
- Prioritize pre-training over fine-tuning for limited language data.
Topics
- Code Generation
- Large Language Models
- No-Resource Languages
- Instruction Transferring
- Model Specialization
- Performance Benchmarking
Code references
- tree-sitter/tree-sitter
- gleam-lang/tree-sitter-gleam
- moonbitlang/tree-sitter-moonbit
- huggingface/open-r1
- features/copilot
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.