Multi-Task Tennis Stroke Biomechanics Analysis Using MediaPipe Pose

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

A multi-task pipeline for tennis stroke biomechanics analysis has been developed, utilizing plain RGB video and MediaPipe Pose Landmarker's 33 metric world coordinates. This system automatically identifies strokes using a weighted joint velocity score (s(t) = 0.5 v_wrist + 0.3 m_elbow + 0.2 m_shoulder) and performs stroke recognition, shot direction prediction, and posture quality grading, complemented by a rule-based feedback layer. The core is TennisTransformerGPU, a 564,103-parameter transformer (4 layers, 4 heads, d=128) with three output heads, processing 30-frame by 39-feature sequences. Trained on 1,281 strokes from 7 pros and 1 amateur, it achieved 83.7% stroke-type accuracy, 61.9% on direction, and 62.6% on posture. A cross-player evaluation showed stroke-type accuracy remained high at 82.9%, but direction prediction failed to transfer. Crucially, an ablation study revealed that using world coordinates is vital, as image-space landmarks significantly reduced accuracy. The system is fully reproducible on Kaggle's free T4 GPU tier.

Key takeaway

For sports biomechanics researchers developing automated coaching tools, this work highlights the critical importance of using metric world coordinates from pose estimation, like MediaPipe Pose. Relying on image-space landmarks will severely degrade cross-player transferability and overall accuracy. You should prioritize robust 3D pose data and consider compact transformer architectures for multi-task analysis to ensure your systems generalize effectively across different athletes.

Key insights

A multi-task transformer pipeline analyzes tennis biomechanics from RGB video, leveraging MediaPipe's world coordinates for robust stroke recognition and posture grading.

Principles

Method

The pipeline automatically finds strokes via a weighted joint velocity score, then feeds 30-frame, 39-feature sequences from MediaPipe Pose (world coordinates) into a 564,103-parameter TennisTransformerGPU with three parallel output heads.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.