Sample-Level Versioning for ML Pipelines with Dagster and Metaxy

· Source: Dagster Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Data Science & Analytics · Depth: Intermediate, medium

Summary

Anam, a company developing real-time interactive avatars, has open-sourced Metaxy, a new framework designed to solve sample-level versioning for multimodal data pipelines orchestrated by Dagster. Traditional data versioning systems, including Dagster's native capabilities, typically operate at an asset or table level, leading to unnecessary re-computation of downstream steps when only a subset of data fields are affected by an upstream change. Metaxy addresses this by tracking partial data updates with field-level granularity, where each data version is a dictionary aware of specific data fields (e.g., audio vs. video frames). This allows Metaxy to detect and skip re-computation for unaffected fields, significantly reducing compute and I/O costs in expensive multimodal pipelines involving video, audio, and embeddings. Metaxy is infrastructure-agnostic, leveraging projects like Ibis and Narwhals for database and DataFrame engine compatibility, and integrates seamlessly with Dagster's asset-oriented declarative DSL.

Key takeaway

For MLOps Engineers managing multimodal data pipelines, Metaxy offers a critical solution to control costs and improve efficiency. If your current Dagster pipelines re-compute entire assets due to minor upstream changes affecting only specific data fields, integrating Metaxy will allow you to implement granular, sample-level versioning. This enables intelligent skipping of unaffected processing steps, preventing wasteful re-execution of expensive ML models on large datasets and optimizing your cloud spend.

Key insights

Metaxy enables granular, sample-level versioning for multimodal data pipelines, optimizing compute by tracking field-specific dependencies.

Principles

Method

Metaxy extends Dagster's versioning by tracking individual sample fields, computing dictionary-based versions, and resolving updates to identify new, stale, or orphaned samples for selective processing.

In practice

Topics

Code references

Best for: Computer Vision Engineer, MLOps Engineer, Machine Learning Engineer, Data Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Dagster Blog.