Between Scripts and Frameworks: Escaping the Complexity Trap in ML Pipelines
Summary
A data scientist's reflection on machine learning pipeline design highlights the challenges of complexity in growing projects, where tightly coupled objects and hidden dependencies hinder debugging and reproducibility. The author initially considered Kedro, an open-source Python framework, for its emphasis on explicit data dependencies and reproducible execution, but faced internal skepticism regarding framework overhead. This led to the development of `linepipe`, a minimal, dependency-free Python library inspired by Kedro's core principles. `linepipe` focuses on linear execution of pure functions, explicit inputs/outputs, and optional caching, aiming to solve complexity issues with minimal adoption friction. While `linepipe` found use in data engineering workflows for deterministic extraction and transformation, it did not replace more comprehensive frameworks like Kedro for complex ML projects requiring features like parallelization or versioning.
Key takeaway
For AI Engineers building or maintaining ML pipelines, recognize that hidden dependencies and implicit execution order are major sources of complexity. You should prioritize tools and practices that enforce explicit data flow and function execution, even if it means developing lightweight internal solutions like `linepipe` to avoid the overhead of full frameworks. This approach can significantly improve debugging, iteration speed, and confidence in reproducibility for simpler, linear workflows.
Key insights
Complexity, not framework absence, is the primary pain point in growing ML projects.
Principles
- Prioritize explicit data dependencies.
- Ensure reproducible execution.
- Separate function logic from data flow.
Method
`linepipe` orchestrates linear pipelines using string-named inputs/outputs, resolving them via an internal object registry, and supports optional caching for faster iteration and debugging.
In practice
- Use `linepipe` for deterministic data engineering.
- Consider `linepipe` for simple ML experiments.
- Combine `linepipe` with `pydantic` for config.
Topics
- ML Pipelines
- Pipeline Orchestration
- Kedro Framework
- Data Engineering
- linepipe
Code references
Best for: AI Engineer, Data Scientist, Machine Learning Engineer, Data Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning on Medium.