I fine-tuned Gemma 3 27B on code and got 98.78% HumanEval / 73% MBPP. Here’s the honest breakdown including all the eval bugs I hit.

2026-05-14 · Source: Machine Learning ML & Generative AI News · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, medium

Summary

A QLoRA fine-tune of Google's Gemma-3-27B-IT model, named Forge-Gemma-3-27B-GGUF, has been released for code generation across Python, JS, Java, C++, and C. The model, trained on approximately 33,000 filtered and deduplicated samples from self-oss-instruct and CodeAlpaca datasets, achieved a 98.78% pass@1 on HumanEval and 73% on MBPP. This represents a significant +14.8 percentage point improvement on HumanEval compared to the base model's ~84%, while MBPP scores remained roughly flat. The author emphasizes the critical role of debugging the evaluation harness, detailing issues such as duplicate function stubs in HumanEval and hardcoded function names in MBPP, which initially led to drastically low scores. The release includes the model in Q4_K_M GGUF format (~17GB), runnable on GPUs with 12GB VRAM (partial offload) or 18GB (full offload), and provides a FastAPI inference server.

Key takeaway

For AI Engineers and ML practitioners evaluating or fine-tuning code generation models, meticulously debug your evaluation scripts and understand benchmark specifics. Your initial low scores might stem from eval harness bugs, not model performance. Ensure your training data aligns with the target problem styles, as demonstrated by the HumanEval vs. MBPP gap, to avoid misinterpreting generalization capabilities.

Key insights

Careful evaluation harness design and debugging are critical for accurate LLM benchmark results.

Principles

Training data distribution impacts generalization.
Tokenizer compatibility is crucial for model integrity.

Method

The fine-tuning pipeline involved dataset curation, QLoRA training, LoRA merge, GGUF export, FastAPI inference server setup, and a custom eval harness for benchmarks.

In practice

Use llama.cpp b3447+ for Gemma 3 GGUF export.
Set `temp=0.1` and `min_p=0.05` for code generation.
Add MBPP-style data to improve algorithmic performance.

Topics

Gemma 3 27B
QLoRA Fine-tuning
Code Generation
HumanEval Benchmark
MBPP Benchmark

Code references

thesis09/Finetuned-Google-Gemma3-27B-It-for-code-generator-or-vibe-coder

Best for: Machine Learning Engineer, AI Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.