PyMETA: A Benchmark Dataset for Hierarchical Student Code Error Classification with Python-Interpreter-Based Labels

2025-12-30 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Software Development & Engineering · Depth: Expert, extended

Summary

PyMETA is a new, large-scale Python code error classification dataset comprising 48,646 student submissions across 155 problems from 579 users. It features a three-level hierarchical taxonomy, categorizing errors from binary (Error/No Error) to 14 fine-grained types based on Python's official exception hierarchy. A diagnostic subset of 97 expert-annotated samples supports multi-error analysis. Evaluations on PyMETA show that finetuned smaller models, like CodeLlama-7B (80.6% macro F1 for single-error), outperform prompted LLMs, with Gemini 2.5 Pro achieving the best LLM performance at 71.9% macro F1. A significant finding is the consistent over-prediction of "Logic Error" by LLMs, with GPT-3.5 showing a 92.8% Logic Error Overprediction Rate, while Gemini 2.5 Pro had the lowest at 17.6%. The dataset and findings provide a foundation for LLM-based code error research and educational tools.

Key takeaway

For AI Scientists and Machine Learning Engineers developing code diagnosis tools, you should prioritize finetuning smaller models like CodeLlama-7B over relying solely on prompting larger LLMs for Python error classification. Be aware that current LLMs exhibit a strong bias towards over-predicting "Logic Error"; your models will require specific strategies to mitigate this, especially for fine-grained error detection. Consider using PyMETA to benchmark and refine your models' ability to handle multi-error scenarios and reduce diagnostic uncertainty.

Key insights

PyMETA offers a large, hierarchically-taxonomized Python code error dataset for LLM evaluation, revealing biases and performance gaps.

Principles

Finetuned small models outperform prompted LLMs.
LLMs exhibit strong "Logic Error" overprediction bias.
Higher prediction entropy signals lower reliability.

Method

PyMETA's construction involves collecting 48,646 student submissions, deriving single-error labels from IDE execution, and expert-annotating a 97-sample diagnostic subset for multiple concurrent errors using confusion-matrix and entropy-based sampling.

In practice

Use PyMETA to benchmark code diagnosis LLMs.
Prioritize finetuning smaller models over prompting large ones.
Address LLM "Logic Error" overprediction in training.

Topics

PyMETA Dataset
Code Error Classification
Large Language Models
Python Programming Education
Model Benchmarking
Logic Error Overprediction

Code references

Circle-Cat/pymeta

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.