A multi-modal dataset for insect biodiversity with imagery and DNA at the trap and individual level
Summary
The MassID45 dataset, published in Scientific Data in April 2026, provides a multi-modal resource for insect biodiversity research, combining molecular and imaging data at both unsorted bulk sample and individual specimen levels. It comprises 45 bulk arthropod samples, primarily insects, collected from Malaise traps in Sweden and Finland during 2021. Each sample includes DNA metabarcoding data, one or more unsorted bulk images, and sample-level biomass measurements. Additionally, the dataset offers individual-level images and DNA barcode sequences for 35,510 sorted specimens. Human annotators, assisted by an AI tool, created segmentation masks for over 17,000 individual arthropods within bulk images and assigned taxonomic labels, guided by DNA-based sample-specific taxonomies. This dataset is designed to train automatic classifiers for bulk insect samples, addressing challenges in tiny object detection and instance segmentation for ecological and machine learning applications.
Key takeaway
For Computer Vision Engineers developing automated biodiversity monitoring systems, MassID45 offers a critical benchmark for instance segmentation of tiny, densely packed arthropods. You should leverage this dataset to fine-tune models for bulk insect sample analysis, as zero-shot approaches perform significantly worse than supervised methods. Focus on optimizing models for order or family-level classification, where annotation confidence is highest, to achieve ecologically relevant results despite challenges in species-level resolution.
Key insights
MassID45 integrates bulk and individual insect imagery with DNA data to train high-throughput taxonomic classifiers.
Principles
- Combining DNA and image data enhances classifier performance.
- Higher taxonomic ranks (order/family) are more reliably annotated in bulk images.
- Tiling images preserves pixel density for small object detection.
Method
The annotation workflow involves initial watershed segmentation, AI-assisted mask refinement using TORAS, and taxonomic labeling guided by DNA barcoding-derived sample-specific taxonomies. Images are tiled for deep learning training and inference, with SAHI used for merging predictions.
In practice
- Use MassID45 for instance segmentation of tiny, densely packed objects.
- Apply transfer learning with MS-COCO pretrained weights for specialized tasks.
- Implement SAHI for merging predictions across overlapping image tiles.
Topics
- MassID45 Dataset
- Insect Biodiversity Monitoring
- Multi-modal Data Integration
- Instance Segmentation
- DNA Barcoding
Code references
Best for: Computer Vision Engineer, AI Scientist, Research Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine learning : nature.com subject feeds.