Repository-Level Solidity Code Generation with Large Language Models: From Prompting to Fine-Tuning

2026-06-19 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Blockchain & Distributed Ledger Technology · Depth: Expert, extended

Summary

A new benchmark, SolidityBench, comprising 5,470 repository-level Solidity smart contracts with natural language descriptions, has been introduced to address the challenges of domain-specific code generation for LLMs. Alongside this, a semantics-aware evaluation metric, SolidityScore, was developed, prioritizing critical Solidity constructs over surface-level token matching. An empirical evaluation of models like Qwen2.5-Coder, DeepSeek-Coder, and CodeLlama across zero-shot, CoT, ICL, RAG, and SFT paradigms revealed that general-purpose models exhibit structural deficiencies in Solidity. RAG achieved the strongest performance among non-parametric methods, while ICL faced context saturation beyond two examples. Supervised fine-tuning (SFT) emerged as the most effective adaptation strategy, significantly improving semantic correctness by internalizing Solidity-specific constraints. The study concludes that combining high-quality domain data with SFT is optimal for reliable LLM-generated smart contracts.

Key takeaway

For Machine Learning Engineers developing LLM-based Solidity code generation tools, prioritize supervised fine-tuning (SFT) with high-quality, domain-specific datasets like SolidityBench. While retrieval-augmented generation (RAG) offers some benefits, SFT fundamentally internalizes critical Solidity constraints, leading to significantly more reliable and semantically accurate smart contracts. Be aware that even SFT-generated code may face compilation challenges due to complex project-level dependencies.

Key insights

High-quality domain data and supervised fine-tuning are crucial for reliable LLM-generated Solidity smart contracts.

Principles

Domain-specific LLM performance requires specialized data and evaluation.
Context saturation limits in-context learning effectiveness.
Internalizing domain knowledge via SFT surpasses external prompting.

Method

A three-phase framework: construct SolidityBench (5,470 NL-code pairs), investigate adaptation paradigms (CoT, ICL, RAG, SFT), and benchmark performance using SolidityScore and BLEU.

In practice

For non-parametric generation, use RAG over ICL.
Limit ICL examples to two to avoid context saturation.
Prioritize SFT with domain data for robust Solidity LLMs.

Topics

Solidity Code Generation
Smart Contracts
Large Language Models
Supervised Fine-Tuning
Retrieval-Augmented Generation
SolidityScore Metric

Code references

ChenS0827/SCG

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.