Sign-Language Datasets at Scale: A Comprehensive Survey on Resources, Benchmarks, and Annotation Standards
Summary
A comprehensive survey indexes 120 sign-language datasets across 35 sign languages, addressing critical challenges in sign-language recognition, translation, and production. The analysis reveals significant fragmentation, inconsistent annotations, and limited linguistic coverage across existing resources, which constrain advances in automated sign language technologies. Key issues identified include modality imbalance, annotation granularity, and signer bias. To foster standardization and reproducibility, the survey introduces a 24-field "Sign-Language Datasheet" and provides a public GitHub repository with consolidated benchmark results. This work aims to establish a unified foundation for developing inclusive, robust, and scalable sign-language AI applications.
Key takeaway
For AI scientists and ML engineers developing sign language technologies, you should prioritize dataset curation that addresses current fragmentation and bias. Focus on incorporating diverse linguistic and demographic coverage, standardizing annotation practices, and ensuring long-term data accessibility. This approach will improve model generalizability and reduce performance disparities across different sign languages and user communities, fostering more inclusive and robust AI systems.
Key insights
Fragmented sign language datasets and inconsistent annotations hinder robust, scalable AI development.
Principles
- Dataset accessibility and documentation quality drive impact and sustainability.
- Linguistic and geographic diversity are crucial for model generalizability.
- Consistent data formats and annotation schemas improve interoperability.
Method
The survey proposes a 24-field "Sign-Language Datasheet" for structured documentation, covering properties like modality, demographics, and vocabulary scale to standardize reporting.
In practice
- Actively include left-dominant signers and report handedness distributions.
- Segment long videos at semantically coherent sign boundaries.
- Utilize ELAN for hierarchical, multimodal annotation to ensure interoperability.
Topics
- Sign Language Recognition
- Sign Language Translation
- Sign Language Production
- Dataset Curation
- Annotation Standards
- Linguistic Diversity
Code references
Best for: Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.