Task Routers in Prodigy
Summary
Prodigy version 1.12 introduces custom Task Routers, a powerful feature allowing machine learning engineers to define how annotation tasks are distributed among annotators using custom Python code. This addresses the challenge of mapping tasks to a pool of annotators, supporting scenarios from single-annotator assignments to full or partial overlap, and even conditional routing based on specific task properties like language or model confidence scores. Task routers are Python functions integrated into Prodigy recipes, receiving the controller, session ID, and current example, then returning a list of target session IDs. Consistent routing across server restarts can be achieved by pre-defining annotators using the "PRODIGY_ALLOWED_SESSIONS" environment variable and employing a deterministic hashing trick for even task distribution. Prodigy also provides built-in task routers configurable via "prodigy.json" for simpler overlap requirements.
Key takeaway
For Machine Learning Engineers or Data Annotation Leads designing complex annotation workflows, Prodigy's custom task routers in version 1.12 offer critical flexibility. You can implement bespoke Python logic to manage annotator overlap, route tasks based on data attributes (e.g., language), or integrate model confidence scores. To ensure consistent task distribution and avoid imbalances, explicitly define your annotator pool using the "PRODIGY_ALLOWED_SESSIONS" environment variable and consider deterministic hashing for task assignment. This enables precise control over your annotation process.
Key insights
Prodigy's custom task routers enable highly flexible, code-driven control over data annotation task distribution and annotator overlap.
Principles
- Deterministic routing ensures consistency across server restarts.
- Pre-defining sessions improves routing consistency.
- Task routers can integrate database state or model confidence.
Method
Define a Python function (controller, session ID, example) returning target session IDs. Use a hashing trick (task hash % pool length) for consistent, even task assignment.
In practice
- Utilize "PRODIGY_ALLOWED_SESSIONS" for consistent multi-annotator routing.
- Access controller's database methods within task routers for state-aware logic.
- Configure "prodigy.json" for basic "feed_overlap" or "annotations_per_task".
Topics
- Data Annotation
- Prodigy
- Task Routing
- Machine Learning Engineering
- Python Development
- Hashing Algorithms
Best for: Machine Learning Engineer, AI Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.