Where Do Large Language Models Fail on Competitive Programming? A Taxonomy of Failures by Algorithm Type and Difficulty Rating
Summary
A systematic empirical study analyzed Large Language Model (LLM) failure patterns in competitive programming using 315 Codeforces problems across seven algorithm categories and three difficulty tiers. Evaluating GPT-4o and Claude Sonnet 4.6 under strict execution-based conditions (T=0.2), researchers compared direct zero-shot generation against Chain-of-Thought (CoT) prompting. Findings reveal CoT aggressively penalizes GPT-4o, dropping its pass rate from 46.0% to 36.8% and exacerbating a critical weakness in Greedy logic. Conversely, Claude Sonnet 4.6 maintained a 63.5% pass rate under CoT but experienced a 244% increase in Compile Errors (from 9 to 31) due to markdown instruction adherence issues. Wrong Answer (WA) was the dominant failure verdict for both models, accounting for over 90% of GPT-4o's and roughly 70% of Claude's unaccepted solutions, indicating a fundamental algorithmic reasoning gap.
Key takeaway
For Machine Learning Engineers building LLM-powered coding assistants, recognize that standard Chain-of-Thought prompting can severely degrade performance, particularly for models like GPT-4o on algorithmic tasks. You should prioritize developing foundational reasoning layers over relying on verbose prompting or simple syntax fixes, as Wrong Answer verdicts dominate failures. Consider evaluating models using a granular algorithm-by-difficulty taxonomy to identify specific blind spots and guide targeted interventions.
Key insights
Chain-of-Thought prompting can degrade LLM code generation performance and instruction adherence in competitive programming.
Principles
- LLM algorithmic reasoning is the primary bottleneck, not syntax or efficiency.
- CoT can "context poison" models, leading to incorrect logic.
- Performance degrades non-linearly with problem difficulty.
Method
Evaluate LLMs on a balanced taxonomy of competitive programming problems across algorithm categories and difficulty tiers. Compare direct generation against CoT prompting under strict execution-based conditions, analyzing failure modes by execution verdict.
In practice
- Avoid zero-shot CoT for GPT-4o on competitive programming tasks.
- Prioritize algorithmic reasoning improvements over syntax fixes.
- Use a 2D taxonomy for granular LLM code evaluation.
Topics
- Large Language Models
- Competitive Programming
- Code Generation
- Failure Analysis
- Chain-of-Thought Prompting
- Algorithm Taxonomy
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, Prompt Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.