Libras-UFPel Corpus: A Parallel Dataset of Brazilian Sign Language and Portuguese for Multimodal Research and Processing
Summary
The Libras-UFPel Corpus is a new multimodal, multilayer parallel dataset for Brazilian Sign Language (Libras) and written Portuguese, designed for computational analysis and documentation. It integrates 4,800 controlled audiovisual records, consisting of 2,400 sentences and 2,400 isolated signs, each fully paired with Portuguese translations. Additionally, the corpus includes about 10 hours of spontaneous interaction from three naturalistic interviews, which are currently being edited. To date, 1,200 controlled sentences have been lemmatized, gloss-annotated, and translated, forming a structured parallel subset. This resource supports Libras-to-Portuguese Sign Language Processing tasks like recognition and machine translation, following a hierarchical annotation model that covers lexical, partially lexical, and non-lexical signs, alongside independent tiers for non-manual markers.
Key takeaway
For NLP Engineers and AI Scientists working on accessibility, the Libras-UFPel Corpus offers a critical resource for developing models for Brazilian Sign Language. Your efforts in sign language processing, such as recognition and machine translation, can directly benefit from this structured, multimodal dataset, advancing digital inclusion for the deaf community.
Key insights
The Libras-UFPel Corpus provides a parallel multimodal dataset for Brazilian Sign Language and Portuguese.
Principles
- Integrate controlled and naturalistic data.
- Ensure interoperability via shared standards.
- Employ hierarchical annotation for sign language.
Method
The corpus development involves controlled audiovisual recordings, naturalistic interviews, and systematic annotation including lemmatization, glossing, and translation, with a hierarchical model for sign types and non-manual markers.
In practice
- Develop Sign Language Recognition systems.
- Build Sign Language Machine Translation models.
- Conduct descriptive linguistic analysis of Libras.
Topics
- Libras-UFPel Corpus
- Brazilian Sign Language
- Multimodal Datasets
- Sign Language Processing
- Natural Language Processing
Best for: AI Scientist, NLP Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.