GeoDial: A Multimodal Conversational Tutoring Dataset for Geometry Problem-Solving with Visual Tutor Turns
Summary
GeoDial is a new multimodal tutoring dataset comprising over 1.3K teacher-student dialogs focused on geometry problem-solving, collected from experienced math teachers. Unlike most existing datasets, GeoDial explicitly grounds instructional turns in diagram highlights, addressing the limitation of text-only AI tutors. The dataset was created using a scalable annotation protocol integrating dialog acts, visual highlighting, and feedback. Experiments fine-tuning Vision-Language Models (VLMs) like Qwen3-VL-32B on GeoDial show substantial improvements in generating tutoring utterances, but models struggle to produce accurate diagram highlights. This challenge stems from highlight sparsity and the tight coupling between visual actions and teacher utterances, indicating a key area for future research in visually grounded pedagogical modeling.
Key takeaway
For AI Scientists and Machine Learning Engineers developing educational AI, you should prioritize multimodal datasets like GeoDial to advance visually grounded tutoring. Current VLMs show promise in generating pedagogical language but struggle with precise diagram highlighting; therefore, focus your research on improving visual reasoning integration with instructional strategies, potentially by exploring separate highlight prediction models or weighted training approaches to overcome data sparsity.
Key insights
Multimodal tutoring datasets with visual grounding are crucial for developing effective AI tutors.
Principles
- Visual information plays a critical role in learning across subjects.
- AI tutors require multimodal datasets to mimic human teaching effectively.
Method
The GeoDial annotation protocol integrates dialog acts, visual highlighting, and feedback, using VLM-simulated students and VLM-suggested teacher utterances to scale data collection.
In practice
- Fine-tune VLMs on multimodal datasets to improve tutor utterance generation.
- Employ weighted training for highlight generation to address data imbalance.
Topics
- Multimodal Tutoring
- Geometry Problem-Solving
- Vision-Language Models
- Educational Datasets
- Diagram Highlighting
- AI Tutors
Best for: AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.