Vibe Coding Ate My Homework: An evaluation of AI approaches to greenfield software engineering and programming
Summary
This paper evaluates "vibe coding" for greenfield software engineering tasks. This practice involves generating code from natural language prompts without human review. A Python-focused evaluation suite was developed for simple, isolated greenfield tasks. It tested four locally run Ollama models (Phi4, Gemma, Mistral, Qwen_coder) across three levels of technical prompt depth. The suite incorporates a retry system and uses LLMs for scoring, with manual audits revealing a conservative scorer. Initial results showed Phi4 performing best, but manual adjustment indicated models were roughly on par. The study challenged the hypothesis that more technical prompts improve reliability, suggesting a slight decline in performance for more detailed prompts. It concludes that an entirely hands-off approach to vibe coding is not viable long-term for larger, higher-stakes software development due to compounding errors.
Key takeaway
For software engineers considering "vibe coding" for greenfield projects, recognize its current limitations. The study shows that even simple tasks yield an 80-90% accuracy, which compounds into significant error rates for larger systems without human review. You should maintain human oversight for critical applications and prioritize clear, concise prompts over overly detailed ones to minimize hallucinations and architectural flaws.
Key insights
Vibe coding's hands-off approach is unreliable for complex greenfield software, despite initial efficiency gains.
Principles
- LLMs prioritize "ideal answer" appearance over admitting limitations or uncertainty.
- Concise, informative prompts maximize consistency, avoiding ambiguous words like "given".
- Probabilistic LLMs introduce hallucination risk when scoring ambiguous outputs semantically.
Method
An R-orchestrated evaluation suite sends Python greenfield tasks to LLMs, parses JSON-wrapped code, executes it in a tracking namespace, and scores output using text/vision models or deterministic checks, with a single retry.
In practice
- Avoid ambiguous words in prompts to minimize LLM misinterpretation.
- Balance prompt brevity with context for simple tasks to maximize consistency.
- Implement retry systems for LLM code generation to assess course-correction capabilities.
Topics
- Vibe Coding
- Greenfield Software Engineering
- LLM Code Generation
- Prompt Engineering
- Software Evaluation
- Python Programming
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.