I fine-tuned Gemma 3 27B on code and got 98.78% HumanEval / 73% MBPP. Here’s the honest breakdown including all the eval bugs I hit.
Summary
A QLoRA fine-tune of Google's Gemma-3-27B-IT model, named Forge-Gemma-3-27B-GGUF, has been released for code generation across Python, JS, Java, C++, and C. The model, trained on approximately 33,000 filtered and deduplicated samples from self-oss-instruct and CodeAlpaca datasets, achieved a 98.78% pass@1 on HumanEval and 73% on MBPP. This represents a significant +14.8 percentage point improvement on HumanEval compared to the base model's ~84%, while MBPP scores remained roughly flat. The author emphasizes the critical role of debugging the evaluation harness, detailing issues such as duplicate function stubs in HumanEval and hardcoded function names in MBPP, which initially led to drastically low scores. The release includes the model in Q4_K_M GGUF format (~17GB), runnable on GPUs with 12GB VRAM (partial offload) or 18GB (full offload), and provides a FastAPI inference server.
Key takeaway
For AI Engineers and ML practitioners evaluating or fine-tuning code generation models, meticulously debug your evaluation scripts and understand benchmark specifics. Your initial low scores might stem from eval harness bugs, not model performance. Ensure your training data aligns with the target problem styles, as demonstrated by the HumanEval vs. MBPP gap, to avoid misinterpreting generalization capabilities.
Key insights
Careful evaluation harness design and debugging are critical for accurate LLM benchmark results.
Principles
- Training data distribution impacts generalization.
- Tokenizer compatibility is crucial for model integrity.
Method
The fine-tuning pipeline involved dataset curation, QLoRA training, LoRA merge, GGUF export, FastAPI inference server setup, and a custom eval harness for benchmarks.
In practice
- Use llama.cpp b3447+ for Gemma 3 GGUF export.
- Set `temp=0.1` and `min_p=0.05` for code generation.
- Add MBPP-style data to improve algorithmic performance.
Topics
- Gemma 3 27B
- QLoRA Fine-tuning
- Code Generation
- HumanEval Benchmark
- MBPP Benchmark
Code references
Best for: Machine Learning Engineer, AI Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.