PyMETA: A Benchmark Dataset for Hierarchical Student Code Error Classification with Python-Interpreter-Based Labels

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Software Development & Engineering · Depth: Expert, extended

Summary

PyMETA is a new, large-scale Python code error classification dataset comprising 48,646 student submissions across 155 problems from 579 users. It features a three-level hierarchical taxonomy, categorizing errors from binary (Error/No Error) to 14 fine-grained types based on Python's official exception hierarchy. A diagnostic subset of 97 expert-annotated samples supports multi-error analysis. Evaluations on PyMETA show that finetuned smaller models, like CodeLlama-7B (80.6% macro F1 for single-error), outperform prompted LLMs, with Gemini 2.5 Pro achieving the best LLM performance at 71.9% macro F1. A significant finding is the consistent over-prediction of "Logic Error" by LLMs, with GPT-3.5 showing a 92.8% Logic Error Overprediction Rate, while Gemini 2.5 Pro had the lowest at 17.6%. The dataset and findings provide a foundation for LLM-based code error research and educational tools.

Key takeaway

For AI Scientists and Machine Learning Engineers developing code diagnosis tools, you should prioritize finetuning smaller models like CodeLlama-7B over relying solely on prompting larger LLMs for Python error classification. Be aware that current LLMs exhibit a strong bias towards over-predicting "Logic Error"; your models will require specific strategies to mitigate this, especially for fine-grained error detection. Consider using PyMETA to benchmark and refine your models' ability to handle multi-error scenarios and reduce diagnostic uncertainty.

Key insights

PyMETA offers a large, hierarchically-taxonomized Python code error dataset for LLM evaluation, revealing biases and performance gaps.

Principles

Method

PyMETA's construction involves collecting 48,646 student submissions, deriving single-error labels from IDE execution, and expert-annotating a 97-sample diagnostic subset for multiple concurrent errors using confusion-matrix and entropy-based sampling.

In practice

Topics

Code references

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.