Where Do Large Language Models Fail on Competitive Programming? A Taxonomy of Failures by Algorithm Type and Difficulty Rating

2026-06-05 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

A systematic empirical study analyzed Large Language Model (LLM) failure patterns in competitive programming using 315 Codeforces problems across seven algorithm categories and three difficulty tiers. Evaluating GPT-4o and Claude Sonnet 4.6 under strict execution-based conditions (T=0.2), researchers compared direct zero-shot generation against Chain-of-Thought (CoT) prompting. Findings reveal CoT aggressively penalizes GPT-4o, dropping its pass rate from 46.0% to 36.8% and exacerbating a critical weakness in Greedy logic. Conversely, Claude Sonnet 4.6 maintained a 63.5% pass rate under CoT but experienced a 244% increase in Compile Errors (from 9 to 31) due to markdown instruction adherence issues. Wrong Answer (WA) was the dominant failure verdict for both models, accounting for over 90% of GPT-4o's and roughly 70% of Claude's unaccepted solutions, indicating a fundamental algorithmic reasoning gap.

Key takeaway

For Machine Learning Engineers building LLM-powered coding assistants, recognize that standard Chain-of-Thought prompting can severely degrade performance, particularly for models like GPT-4o on algorithmic tasks. You should prioritize developing foundational reasoning layers over relying on verbose prompting or simple syntax fixes, as Wrong Answer verdicts dominate failures. Consider evaluating models using a granular algorithm-by-difficulty taxonomy to identify specific blind spots and guide targeted interventions.

Key insights

Chain-of-Thought prompting can degrade LLM code generation performance and instruction adherence in competitive programming.

Principles

LLM algorithmic reasoning is the primary bottleneck, not syntax or efficiency.
CoT can "context poison" models, leading to incorrect logic.
Performance degrades non-linearly with problem difficulty.

Method

Evaluate LLMs on a balanced taxonomy of competitive programming problems across algorithm categories and difficulty tiers. Compare direct generation against CoT prompting under strict execution-based conditions, analyzing failure modes by execution verdict.

In practice

Avoid zero-shot CoT for GPT-4o on competitive programming tasks.
Prioritize algorithmic reasoning improvements over syntax fixes.
Use a 2D taxonomy for granular LLM code evaluation.

Topics

Large Language Models
Competitive Programming
Code Generation
Failure Analysis
Chain-of-Thought Prompting
Algorithm Taxonomy

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, Prompt Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.