A multi-modal dataset for insect biodiversity with imagery and DNA at the trap and individual level

· Source: Machine learning : nature.com subject feeds · Field: Science & Research — Life Sciences & Biology, Environmental Science & Earth Systems, Research Methodology & Innovation · Depth: Expert, extended

Summary

The MassID45 dataset, published in Scientific Data in April 2026, provides a multi-modal resource for insect biodiversity research, combining molecular and imaging data at both unsorted bulk sample and individual specimen levels. It comprises 45 bulk arthropod samples, primarily insects, collected from Malaise traps in Sweden and Finland during 2021. Each sample includes DNA metabarcoding data, one or more unsorted bulk images, and sample-level biomass measurements. Additionally, the dataset offers individual-level images and DNA barcode sequences for 35,510 sorted specimens. Human annotators, assisted by an AI tool, created segmentation masks for over 17,000 individual arthropods within bulk images and assigned taxonomic labels, guided by DNA-based sample-specific taxonomies. This dataset is designed to train automatic classifiers for bulk insect samples, addressing challenges in tiny object detection and instance segmentation for ecological and machine learning applications.

Key takeaway

For Computer Vision Engineers developing automated biodiversity monitoring systems, MassID45 offers a critical benchmark for instance segmentation of tiny, densely packed arthropods. You should leverage this dataset to fine-tune models for bulk insect sample analysis, as zero-shot approaches perform significantly worse than supervised methods. Focus on optimizing models for order or family-level classification, where annotation confidence is highest, to achieve ecologically relevant results despite challenges in species-level resolution.

Key insights

MassID45 integrates bulk and individual insect imagery with DNA data to train high-throughput taxonomic classifiers.

Principles

Method

The annotation workflow involves initial watershed segmentation, AI-assisted mask refinement using TORAS, and taxonomic labeling guided by DNA barcoding-derived sample-specific taxonomies. Images are tiled for deep learning training and inference, with SAHI used for merging predictions.

In practice

Topics

Code references

Best for: Computer Vision Engineer, AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine learning : nature.com subject feeds.