Smaller Models, Unexpected Costs: Trade-offs in LLM Quantization for Automated Program Repair
Summary
This study empirically evaluates Large Language Model (LLM) quantization for Automated Program Repair (APR), analyzing 13 configurations across six LLMs (6.7B to 70B parameters) on HumanEval-Java and Defects4J benchmarks. Findings show quantization reduces memory footprint by up to 85%, yet unexpectedly increases inference time and energy consumption, attributed to suboptimal hardware utilization. While quantized models can match or exceed base model repair effectiveness in 11 of 12 cases, they often fix different sets of bugs, indicating behavioral shifts. A Pareto trade-off analysis revealed 48% of configurations were suboptimal. The research highlights that effectiveness, memory, and energy trade-offs are highly sensitive to model architecture and task complexity, with low-bit quantization often performing poorly.
Key takeaway
For Machine Learning Engineers deploying LLMs for Automated Program Repair, understand that quantization is a nuanced trade-off. While it significantly reduces memory footprint (up to 85%), expect increased inference time and energy consumption due to hardware inefficiencies. You must carefully evaluate specific quantization configurations, like AWQ4(M), for your model and task, as effectiveness and consistency can shift, and nearly half of configurations are suboptimal.
Key insights
LLM quantization for Automated Program Repair yields memory savings but introduces complex, model-dependent trade-offs in effectiveness and efficiency.
Principles
- Quantization can shift model behavior, not just reduce capability.
- Memory reduction does not guarantee faster inference or lower energy.
- Trade-offs are highly model and task-dependent.
Method
An empirical study evaluated 13 post-training quantization configurations (weight-only, KV-cache, 2-8 bits) across six LLMs on two APR benchmarks, measuring plausibility, Jaccard Consistency Rate, inference time, energy, and memory footprint.
In practice
- Evaluate Jaccard Consistency Rate alongside plausibility.
- Consider AWQ4(M) for balanced memory-effectiveness.
- Avoid 2-bit or 3-bit quantization for APR.
Topics
- LLM Quantization
- Automated Program Repair
- Inference Efficiency
- Memory Optimization
- Pareto Trade-offs
- Jaccard Consistency
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.