Sample-Level Versioning for ML Pipelines with Dagster and Metaxy
Summary
Anam, a company developing real-time interactive avatars, has open-sourced Metaxy, a new framework designed to solve sample-level versioning for multimodal data pipelines orchestrated by Dagster. Traditional data versioning systems, including Dagster's native capabilities, typically operate at an asset or table level, leading to unnecessary re-computation of downstream steps when only a subset of data fields are affected by an upstream change. Metaxy addresses this by tracking partial data updates with field-level granularity, where each data version is a dictionary aware of specific data fields (e.g., audio vs. video frames). This allows Metaxy to detect and skip re-computation for unaffected fields, significantly reducing compute and I/O costs in expensive multimodal pipelines involving video, audio, and embeddings. Metaxy is infrastructure-agnostic, leveraging projects like Ibis and Narwhals for database and DataFrame engine compatibility, and integrates seamlessly with Dagster's asset-oriented declarative DSL.
Key takeaway
For MLOps Engineers managing multimodal data pipelines, Metaxy offers a critical solution to control costs and improve efficiency. If your current Dagster pipelines re-compute entire assets due to minor upstream changes affecting only specific data fields, integrating Metaxy will allow you to implement granular, sample-level versioning. This enables intelligent skipping of unaffected processing steps, preventing wasteful re-execution of expensive ML models on large datasets and optimizing your cloud spend.
Key insights
Metaxy enables granular, sample-level versioning for multimodal data pipelines, optimizing compute by tracking field-specific dependencies.
Principles
- Multimodal pipelines require incremental processing.
- Field-level versioning prevents unnecessary re-computation.
- Composable data tooling enhances adaptability.
Method
Metaxy extends Dagster's versioning by tracking individual sample fields, computing dictionary-based versions, and resolving updates to identify new, stale, or orphaned samples for selective processing.
In practice
- Define Metaxy FeatureSpecs for data assets.
- Use @metaxify decorator for Dagster asset integration.
- Employ MetaxyIOManager for flexible I/O.
Topics
- Metaxy
- Sample-Level Versioning
- Multimodal Data Pipelines
- MLOps Orchestration
- Data Versioning
Code references
Best for: Computer Vision Engineer, MLOps Engineer, Machine Learning Engineer, Data Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Dagster Blog.