Smaller Models, Unexpected Costs: Trade-offs in LLM Quantization for Automated Program Repair

2026-06-26 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

This study empirically evaluates Large Language Model (LLM) quantization for Automated Program Repair (APR), analyzing 13 configurations across six LLMs (6.7B to 70B parameters) on HumanEval-Java and Defects4J benchmarks. Findings show quantization reduces memory footprint by up to 85%, yet unexpectedly increases inference time and energy consumption, attributed to suboptimal hardware utilization. While quantized models can match or exceed base model repair effectiveness in 11 of 12 cases, they often fix different sets of bugs, indicating behavioral shifts. A Pareto trade-off analysis revealed 48% of configurations were suboptimal. The research highlights that effectiveness, memory, and energy trade-offs are highly sensitive to model architecture and task complexity, with low-bit quantization often performing poorly.

Key takeaway

For Machine Learning Engineers deploying LLMs for Automated Program Repair, understand that quantization is a nuanced trade-off. While it significantly reduces memory footprint (up to 85%), expect increased inference time and energy consumption due to hardware inefficiencies. You must carefully evaluate specific quantization configurations, like AWQ4(M), for your model and task, as effectiveness and consistency can shift, and nearly half of configurations are suboptimal.

Key insights

LLM quantization for Automated Program Repair yields memory savings but introduces complex, model-dependent trade-offs in effectiveness and efficiency.

Principles

Quantization can shift model behavior, not just reduce capability.
Memory reduction does not guarantee faster inference or lower energy.
Trade-offs are highly model and task-dependent.

Method

An empirical study evaluated 13 post-training quantization configurations (weight-only, KV-cache, 2-8 bits) across six LLMs on two APR benchmarks, measuring plausibility, Jaccard Consistency Rate, inference time, energy, and memory footprint.

In practice

Evaluate Jaccard Consistency Rate alongside plausibility.
Consider AWQ4(M) for balanced memory-effectiveness.
Avoid 2-bit or 3-bit quantization for APR.

Topics

LLM Quantization
Automated Program Repair
Inference Efficiency
Memory Optimization
Pareto Trade-offs
Jaccard Consistency

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.