No Resource, No Benchmarks, No Problem? Evaluating and Improving LLMs for Code Generation in No-Resource Languages

2024-07-18 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

This study addresses the challenge of Large Language Models (LLMs) generating code for "no-resource languages" like Gleam and MoonBit, which lack sufficient training data. Researchers developed and released three code generation benchmarks (HumanEval, MBPP, McEval-Hard) translated into these languages. Initial evaluations of four LLMs (GPT-4o, o3-mini, Qwen 2.5 Coder 32B Instruct, Qwen 3 32B Instruct) showed extremely low pass@1 scores, often 0-1% for complex tasks, primarily due to syntactic errors. The most effective technique found was further pre-training a base LLM on the limited available no-resource language data, boosting pass@1 up to 26% on McEval-Hard for MoonBit. To retain instruction-following capabilities, an instruction transferring method was introduced, which injects instruction weights from an instruct model into the pre-trained base model. This approach achieved pass@1 scores over 25% for Gleam and 33% for MoonBit on McEval-Hard, enabling cost-effective deployment of specialized coding assistants.

Key takeaway

For AI Engineers tasked with developing code generation solutions for proprietary or no-resource languages, you should prioritize the instruction transferring technique. This method allows you to pre-train a base model on limited language data and then cheaply inject instruction-following capabilities, achieving significantly higher performance (e.g., 33% pass@1 on McEval-Hard for MoonBit) than traditional fine-tuning or in-context learning. This approach enables cost-effective deployment of specialized coding assistants, even with smaller models.

Key insights

Instruction transferring on pre-trained base models effectively specializes LLMs for no-resource languages without costly instruction fine-tuning.

Principles

LLM code generation performance strongly depends on language resource availability.
Further pre-training base models significantly boosts no-resource language proficiency.
Weight diff transfer efficiently injects instruction-following into specialized base models.

Method

Pre-train a base LLM on target no-resource language data, then transfer instruction-following capabilities by adding the weight difference from an instruction-tuned model.

In practice

Deploy in-house LLM code assistants for proprietary languages.
Apply weight diff transfer to specialize open-source base models.
Prioritize pre-training over fine-tuning for limited language data.

Topics

Code Generation
Large Language Models
No-Resource Languages
Instruction Transferring
Model Specialization
Performance Benchmarking

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.