NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models
Summary
NextMotionQA introduces a comprehensive benchmark designed to evaluate human motion understanding in vision-language models (VLMs), addressing limitations of existing benchmarks like coarse granularity and answer ambiguity. This new benchmark features three tasks: multiple-choice question answering, video captioning, and fine-grained error correction. These tasks are structured across three semantic axes and stratified into three complexity levels. An extensive evaluation of twelve representative VLMs using NextMotionQA revealed critical capability gaps, particularly in fine-grained analysis, which conventional single-task evaluations miss. Additionally, while VLMs align strongly with expert ratings on coarse text-to-motion evaluation criteria (Cohen's κ=0.70), their performance degrades significantly on fine-grained, part-level judgment (κ=0.10).
Key takeaway
For AI Scientists developing vision-language models for human motion understanding, you should recognize that current benchmarks are insufficient for diagnosing fine-grained capabilities. Utilize NextMotionQA for a comprehensive evaluation of your models, especially to uncover weaknesses in detailed motion analysis. Be cautious when using VLMs as judges for fine-grained motion tasks, as their alignment with expert ratings drops significantly from κ=0.70 to κ=0.10 at this level.
Key insights
NextMotionQA is a new benchmark for human motion understanding, revealing VLM limitations in fine-grained analysis.
Principles
- Existing benchmarks lack granularity and diagnostic power.
- VLMs excel at coarse but fail at fine-grained motion judgment.
Method
NextMotionQA uses VLMs for semi-automated, expert-verified dataset creation across three tasks and complexity levels to evaluate human motion understanding.
In practice
- Diagnose VLM weaknesses in embodied AI and robotics.
- Improve VLM evaluation for fine-grained motion tasks.
Topics
- NextMotionQA
- Vision-Language Models
- Human Motion Understanding
- Benchmarking
- Embodied AI
- Video Captioning
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.