Where Do Large Language Models Fail on Competitive Programming? A Taxonomy of Failures by Algorithm Type and Difficulty Rating

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

A systematic empirical study analyzed Large Language Model (LLM) failure patterns in competitive programming using 315 Codeforces problems across seven algorithm categories and three difficulty tiers. Evaluating GPT-4o and Claude Sonnet 4.6 under strict execution-based conditions (T=0.2), researchers compared direct zero-shot generation against Chain-of-Thought (CoT) prompting. Findings reveal CoT aggressively penalizes GPT-4o, dropping its pass rate from 46.0% to 36.8% and exacerbating a critical weakness in Greedy logic. Conversely, Claude Sonnet 4.6 maintained a 63.5% pass rate under CoT but experienced a 244% increase in Compile Errors (from 9 to 31) due to markdown instruction adherence issues. Wrong Answer (WA) was the dominant failure verdict for both models, accounting for over 90% of GPT-4o's and roughly 70% of Claude's unaccepted solutions, indicating a fundamental algorithmic reasoning gap.

Key takeaway

For Machine Learning Engineers building LLM-powered coding assistants, recognize that standard Chain-of-Thought prompting can severely degrade performance, particularly for models like GPT-4o on algorithmic tasks. You should prioritize developing foundational reasoning layers over relying on verbose prompting or simple syntax fixes, as Wrong Answer verdicts dominate failures. Consider evaluating models using a granular algorithm-by-difficulty taxonomy to identify specific blind spots and guide targeted interventions.

Key insights

Chain-of-Thought prompting can degrade LLM code generation performance and instruction adherence in competitive programming.

Principles

Method

Evaluate LLMs on a balanced taxonomy of competitive programming problems across algorithm categories and difficulty tiers. Compare direct generation against CoT prompting under strict execution-based conditions, analyzing failure modes by execution verdict.

In practice

Topics

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, Prompt Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.