Temporal Stability and Few-Shot Prompting in Math Task Assessment

2026-05-28 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, AI in Education · Depth: Intermediate, quick

Summary

A longitudinal study investigated the temporal stability and few-shot prompting efficacy of AI tools in classifying the cognitive demand of mathematics tasks using the Task Analysis Guide (TAG). Researchers tested a general-purpose tool, Gemini, and an education-specific tool, Coteach. Results showed that model version updates alone had mixed effects: Gemini's accuracy remained stable at 58%, while Coteach's decreased from 75% to 50%. However, few-shot prompting, using two exemplar tasks per category, significantly improved both models' performance, raising Gemini to 67% and Coteach to 75% accuracy. These findings suggest prompt engineering offers more reliable improvements than passive model updates, and version updates may not consistently enhance specialized educational task performance.

Key takeaway

For educators and researchers selecting or implementing AI tools for specialized educational tasks, you should prioritize robust prompt engineering strategies over relying solely on model version updates. Actively test and re-evaluate AI tool performance after any update, as passive improvements are not guaranteed and can even degrade accuracy. Your investment in crafting effective prompts will yield more consistent and significant performance gains.

Key insights

Prompt engineering offers more reliable AI performance improvements on specialized tasks than passive model version updates.

Principles

AI model updates can degrade specialized task performance.
Few-shot prompting reliably boosts AI classification accuracy.
Prompt engineering outweighs passive model improvements.

Method

The study tested Gemini and Coteach at baseline, after model version updates, and then with few-shot prompting (two exemplars per cognitive demand category) for math task classification.

In practice

Implement few-shot prompting for AI task classification.
Routinely re-evaluate AI tools after version updates.
Prioritize prompt engineering over waiting for model updates.

Topics

AI in Education
Prompt Engineering
Cognitive Demand Classification
Model Stability
Few-Shot Learning
Educational Technology

Best for: AI Engineer, Machine Learning Engineer, NLP Engineer, AI Scientist, Research Scientist, Consultant

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.