I Asked ChatGPT, Claude and DeepSeek to Build Tetris
Summary
An editorial analyst conducted a practical test comparing the code generation capabilities of three flagship large language models: Claude Opus 4.5, GPT-5.2 Pro, and DeepSeek V3.2. The objective was to evaluate their ability to build a fully functional Tetris game from a single, detailed prompt, focusing on first-attempt success, feature completeness, playability, and cost-effectiveness. Claude Opus 4.5 delivered a complete, smooth, and playable game in under 2 minutes. GPT-5.2 Pro, despite being OpenAI's most intelligent model and 4x more expensive than Opus 4.5, produced a game with a layout bug on its first attempt, requiring a follow-up prompt to fix, and still resulted in a less smooth user experience. DeepSeek V3.2, the most affordable option, had multiple bugs, including disappearing pieces and scrolling issues, making the game unplayable even after a second iteration. The cost analysis showed Opus 4.5 at ~$0.09 for a playable game, GPT-5.2 Pro at ~$0.41 for a playable but poor UX game, and DeepSeek V3.2 at ~$0.005 for an unplayable game.
Key takeaway
For AI engineers and software developers evaluating LLMs for code generation, prioritize Claude Opus 4.5 for day-to-day coding tasks due to its high first-attempt success and superior output quality, which ultimately saves time and cost. If your project has a tight budget and you have debugging capacity, DeepSeek V3.2 offers a cost-effective alternative, even with multiple iterations. Avoid GPT-5.2 Pro for simple coding, as its strengths lie in complex reasoning, making it over-engineered and less efficient for straightforward development.
Key insights
Model performance for coding tasks varies significantly across LLMs, impacting development cost and user experience.
Principles
- Higher cost does not guarantee superior code generation.
- First-attempt success reduces overall development cost.
- Model suitability depends on task complexity and budget.
Method
Evaluate LLMs for code generation by prompting them to build a complex, interactive application (e.g., Tetris) and assessing first-attempt success, feature completeness, playability, and total cost across iterations.
In practice
- Use Claude Opus 4.5 for reliable daily coding tasks.
- Consider DeepSeek V3.2 for budget-constrained projects.
- Reserve GPT-5.2 Pro for complex reasoning tasks.
Topics
- AI Model Comparison
- Code Generation
- Large Language Model Performance
- Cost-effectiveness
- Game Development
Best for: AI Engineer, Machine Learning Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by KDnuggets.