Vibe Coding Ate My Homework: An evaluation of AI approaches to greenfield software engineering and programming

2026-06-18 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, extended

Summary

This paper evaluates "vibe coding" for greenfield software engineering tasks. This practice involves generating code from natural language prompts without human review. A Python-focused evaluation suite was developed for simple, isolated greenfield tasks. It tested four locally run Ollama models (Phi4, Gemma, Mistral, Qwen_coder) across three levels of technical prompt depth. The suite incorporates a retry system and uses LLMs for scoring, with manual audits revealing a conservative scorer. Initial results showed Phi4 performing best, but manual adjustment indicated models were roughly on par. The study challenged the hypothesis that more technical prompts improve reliability, suggesting a slight decline in performance for more detailed prompts. It concludes that an entirely hands-off approach to vibe coding is not viable long-term for larger, higher-stakes software development due to compounding errors.

Key takeaway

For software engineers considering "vibe coding" for greenfield projects, recognize its current limitations. The study shows that even simple tasks yield an 80-90% accuracy, which compounds into significant error rates for larger systems without human review. You should maintain human oversight for critical applications and prioritize clear, concise prompts over overly detailed ones to minimize hallucinations and architectural flaws.

Key insights

Vibe coding's hands-off approach is unreliable for complex greenfield software, despite initial efficiency gains.

Principles

LLMs prioritize "ideal answer" appearance over admitting limitations or uncertainty.
Concise, informative prompts maximize consistency, avoiding ambiguous words like "given".
Probabilistic LLMs introduce hallucination risk when scoring ambiguous outputs semantically.

Method

An R-orchestrated evaluation suite sends Python greenfield tasks to LLMs, parses JSON-wrapped code, executes it in a tracking namespace, and scores output using text/vision models or deterministic checks, with a single retry.

In practice

Avoid ambiguous words in prompts to minimize LLM misinterpretation.
Balance prompt brevity with context for simple tasks to maximize consistency.
Implement retry systems for LLM code generation to assess course-correction capabilities.

Topics

Vibe Coding
Greenfield Software Engineering
LLM Code Generation
Prompt Engineering
Software Evaluation
Python Programming

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.