Repository-Level Solidity Code Generation with Large Language Models: From Prompting to Fine-Tuning
Summary
A new benchmark, SolidityBench, comprising 5,470 repository-level Solidity smart contracts with natural language descriptions, has been introduced to address the challenges of domain-specific code generation for LLMs. Alongside this, a semantics-aware evaluation metric, SolidityScore, was developed, prioritizing critical Solidity constructs over surface-level token matching. An empirical evaluation of models like Qwen2.5-Coder, DeepSeek-Coder, and CodeLlama across zero-shot, CoT, ICL, RAG, and SFT paradigms revealed that general-purpose models exhibit structural deficiencies in Solidity. RAG achieved the strongest performance among non-parametric methods, while ICL faced context saturation beyond two examples. Supervised fine-tuning (SFT) emerged as the most effective adaptation strategy, significantly improving semantic correctness by internalizing Solidity-specific constraints. The study concludes that combining high-quality domain data with SFT is optimal for reliable LLM-generated smart contracts.
Key takeaway
For Machine Learning Engineers developing LLM-based Solidity code generation tools, prioritize supervised fine-tuning (SFT) with high-quality, domain-specific datasets like SolidityBench. While retrieval-augmented generation (RAG) offers some benefits, SFT fundamentally internalizes critical Solidity constraints, leading to significantly more reliable and semantically accurate smart contracts. Be aware that even SFT-generated code may face compilation challenges due to complex project-level dependencies.
Key insights
High-quality domain data and supervised fine-tuning are crucial for reliable LLM-generated Solidity smart contracts.
Principles
- Domain-specific LLM performance requires specialized data and evaluation.
- Context saturation limits in-context learning effectiveness.
- Internalizing domain knowledge via SFT surpasses external prompting.
Method
A three-phase framework: construct SolidityBench (5,470 NL-code pairs), investigate adaptation paradigms (CoT, ICL, RAG, SFT), and benchmark performance using SolidityScore and BLEU.
In practice
- For non-parametric generation, use RAG over ICL.
- Limit ICL examples to two to avoid context saturation.
- Prioritize SFT with domain data for robust Solidity LLMs.
Topics
- Solidity Code Generation
- Smart Contracts
- Large Language Models
- Supervised Fine-Tuning
- Retrieval-Augmented Generation
- SolidityScore Metric
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.