An Empirical Study of LLM-Generated Specifications for VeriFast

2026-06-26 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

An empirical study evaluated large language models' (LLMs) ability to generate specifications for verifying 303 C functions using the separation logic (SL) verifier VeriFast. Researchers explored eight prompting approaches, ten LLMs, and three input types (Natural Language, Functional Behavior, Functional Behavior Plus) across two stages. The findings indicate LLMs preserve functional behavior in generated source code and specifications over 91% of the time. However, the overall verification success rate was modest at 31.4%. Gemini 2.5 Pro, when combined with formal contract inputs, yielded higher success rates. A critical insight revealed that 94% of errors originated from LLMs' lack of domain-specific knowledge concerning SL verifiers, such as VeriFast's syntax and heap reasoning, rather than general logical reasoning failures. This study provides a baseline and guidance for developing LLM-based verification environments.

Key takeaway

For AI Engineers developing automated verification tools, recognize that while LLMs can preserve functional behavior in generated specifications, their direct verification success with separation logic verifiers like VeriFast is currently modest (31.4%). You should prioritize providing formal specifications as input and consider using models like Gemini 2.5 Pro, which demonstrated superior performance, especially for complex functions. Be prepared to implement feedback loops leveraging verifier error messages or integrate symbolic techniques to address prevalent domain-specific errors in heap reasoning.

Key insights

LLMs preserve functional behavior in VeriFast specifications but achieve modest verification success due to domain-specific knowledge gaps.

Principles

Formal specifications significantly improve LLM verification success.
LLM choice critically affects performance on complex verification tasks.
Most LLM errors are domain-specific, not general logical reasoning failures.

Method

A two-stage empirical study selected the RAG-sparse w/ splitting prompt and top three LLMs (Gemini 2.5 Pro, Claude-3-7, GPT-4o) for generating VeriFast specifications for 303 C functions across three input types.

In practice

Supply formal specifications to LLMs for higher verification rates.
Utilize Gemini 2.5 Pro for better performance on concurrent/loop functions.
Integrate verifier error messages into LLM repair loops.

Topics

Large Language Models
Static Program Verification
Separation Logic
VeriFast
Prompt Engineering
Formal Methods

Code references

verifast/verifast

Best for: Research Scientist, AI Scientist, Software Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.