Sign-Language Datasets at Scale: A Comprehensive Survey on Resources, Benchmarks, and Annotation Standards

2024-09-29 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

A comprehensive survey indexes 120 sign-language datasets across 35 sign languages, addressing critical challenges in sign-language recognition, translation, and production. The analysis reveals significant fragmentation, inconsistent annotations, and limited linguistic coverage across existing resources, which constrain advances in automated sign language technologies. Key issues identified include modality imbalance, annotation granularity, and signer bias. To foster standardization and reproducibility, the survey introduces a 24-field "Sign-Language Datasheet" and provides a public GitHub repository with consolidated benchmark results. This work aims to establish a unified foundation for developing inclusive, robust, and scalable sign-language AI applications.

Key takeaway

For AI scientists and ML engineers developing sign language technologies, you should prioritize dataset curation that addresses current fragmentation and bias. Focus on incorporating diverse linguistic and demographic coverage, standardizing annotation practices, and ensuring long-term data accessibility. This approach will improve model generalizability and reduce performance disparities across different sign languages and user communities, fostering more inclusive and robust AI systems.

Key insights

Fragmented sign language datasets and inconsistent annotations hinder robust, scalable AI development.

Principles

Dataset accessibility and documentation quality drive impact and sustainability.
Linguistic and geographic diversity are crucial for model generalizability.
Consistent data formats and annotation schemas improve interoperability.

Method

The survey proposes a 24-field "Sign-Language Datasheet" for structured documentation, covering properties like modality, demographics, and vocabulary scale to standardize reporting.

In practice

Actively include left-dominant signers and report handedness distributions.
Segment long videos at semantically coherent sign boundaries.
Utilize ELAN for hierarchical, multimodal annotation to ensure interoperability.

Topics

Sign Language Recognition
Sign Language Translation
Sign Language Production
Dataset Curation
Annotation Standards
Linguistic Diversity

Code references

Ginqwerty/Open-Sign-Language

Best for: Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.