CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

2026-06-18 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

The Caltech Tennis Dataset (CalTennis) is introduced as a large-scale video benchmark for monocular-to-3D pose estimation in diverse environments. Comprising over 11 million frames (51 hours) from 40 players, captured by 2-6 synchronized cameras at 60 Hz, CalTennis is 10 times larger than existing in-the-wild human motion datasets and 3 times larger than MOCAP-ground-truthed alternatives. It uniquely provides synchronized multi-view recordings of expert athletic motion, enabling label-free evaluation. The dataset utilizes a simple, standardized collection protocol with automated video calibration and synchronization. Benchmarking state-of-the-art methods on CalTennis reveals that while 3D joint angle recovery is accurate, models consistently struggle with depth and foot contact estimation. The authors propose novel footwork and stability metrics to expose these failure modes and guide future improvements.

Key takeaway

For Computer Vision Engineers developing 3D pose estimation models, CalTennis offers an unprecedented benchmark to rigorously evaluate performance. Your models likely achieve accurate 3D joint angles, but you should prioritize improving depth and foot contact estimation, as these remain significant weaknesses. Utilize the proposed footwork and stability metrics to identify specific failure modes and guide your development efforts toward more robust athletic motion analysis.

Key insights

CalTennis provides a large-scale, multi-view benchmark for 3D pose estimation, highlighting depth and foot contact as critical challenges.

Principles

Multi-view video enables label-free 3D pose evaluation.
Simple protocols can yield large-scale, high-quality datasets.
Current 3D pose models struggle with depth and foot contact.

Method

A simple, standardized protocol enables data collection without specialized equipment or expertise, featuring fully automated video calibration and synchronization.

In practice

Benchmark monocular-to-3D pose algorithms on CalTennis.
Focus model improvements on depth and foot contact.
Apply footwork and stability metrics for action analysis.

Topics

CalTennis
3D Pose Estimation
Multi-View Video
Human Motion Capture
Video Datasets
Depth Estimation
Action Analysis

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.