An Empirical Study of LLM-Generated Specifications for VeriFast
Summary
An empirical study evaluated large language models' (LLMs) ability to generate specifications for verifying 303 C functions using the separation logic (SL) verifier VeriFast. Researchers explored eight prompting approaches, ten LLMs, and three input types (Natural Language, Functional Behavior, Functional Behavior Plus) across two stages. The findings indicate LLMs preserve functional behavior in generated source code and specifications over 91% of the time. However, the overall verification success rate was modest at 31.4%. Gemini 2.5 Pro, when combined with formal contract inputs, yielded higher success rates. A critical insight revealed that 94% of errors originated from LLMs' lack of domain-specific knowledge concerning SL verifiers, such as VeriFast's syntax and heap reasoning, rather than general logical reasoning failures. This study provides a baseline and guidance for developing LLM-based verification environments.
Key takeaway
For AI Engineers developing automated verification tools, recognize that while LLMs can preserve functional behavior in generated specifications, their direct verification success with separation logic verifiers like VeriFast is currently modest (31.4%). You should prioritize providing formal specifications as input and consider using models like Gemini 2.5 Pro, which demonstrated superior performance, especially for complex functions. Be prepared to implement feedback loops leveraging verifier error messages or integrate symbolic techniques to address prevalent domain-specific errors in heap reasoning.
Key insights
LLMs preserve functional behavior in VeriFast specifications but achieve modest verification success due to domain-specific knowledge gaps.
Principles
- Formal specifications significantly improve LLM verification success.
- LLM choice critically affects performance on complex verification tasks.
- Most LLM errors are domain-specific, not general logical reasoning failures.
Method
A two-stage empirical study selected the RAG-sparse w/ splitting prompt and top three LLMs (Gemini 2.5 Pro, Claude-3-7, GPT-4o) for generating VeriFast specifications for 303 C functions across three input types.
In practice
- Supply formal specifications to LLMs for higher verification rates.
- Utilize Gemini 2.5 Pro for better performance on concurrent/loop functions.
- Integrate verifier error messages into LLM repair loops.
Topics
- Large Language Models
- Static Program Verification
- Separation Logic
- VeriFast
- Prompt Engineering
- Formal Methods
Code references
Best for: Research Scientist, AI Scientist, Software Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.