Building Multimodal Corpora Using Microtask Pipelines and Local Annotators
Summary
To support the effort of building multimodal corpora, an existing commercial annotation tool, Prodigy, was repurposed. This infrastructure was then enhanced with additional components designed to combine annotation tasks into pipelines, facilitate cross-validation of annotations, and manage annotator access to these tasks. This approach aims to create a robust and efficient system for handling complex annotation workflows.
Key takeaway
For MLOps engineers or data scientists building custom multimodal annotation systems, consider adapting proven commercial tools like Prodigy. This strategy allows you to focus development efforts on critical enhancements. Prioritize pipeline orchestration, robust cross-validation, and streamlined annotator access, rather than building core functionality from scratch. This approach can accelerate development and improve data quality for complex annotation projects.
Key insights
Repurposing and enhancing existing commercial tools can efficiently build robust multimodal annotation infrastructure.
Principles
- Utilize existing commercial tools.
- Enhance tools for specific needs.
- Integrate pipeline, validation, access.
Method
Repurpose an existing commercial annotation tool (Prodigy), then enhance it with components for task pipelining, cross-validation, and managing annotator access.
In practice
- Adapt Prodigy for custom annotation.
- Build annotation task pipelines.
- Implement cross-validation.
Topics
- Multimodal Corpora
- Annotation Tools
- Prodigy
- Data Pipelines
- Cross-validation
- Annotation Infrastructure
Best for: Data Engineer, MLOps Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.