Shift Left Your PySpark: How Unit Tests, Local Spark, and CI/CD Turn “It Runs” into “It Works”
Summary
The article advocates for a "shift left" approach to PySpark development, moving testing from expensive, slow integration tests in production to early-stage unit testing. It highlights the problem of traditional data engineering practices, where code is often tested directly in Databricks notebooks, leading to long feedback cycles and wasted cloud compute. The proposed solution involves refactoring monolithic notebook code into small, pure Python functions that transform DataFrames, enabling isolated unit tests. Key tools for this approach include `pytest` for test orchestration, `chispa` for expressive DataFrame assertions, and local Spark sessions for cost-free, rapid test execution. The article also details integrating these unit tests into CI/CD pipelines using Azure DevOps and deploying tested code via Databricks Asset Bundles (DABs), ensuring that buggy code is caught before reaching production and promoting disciplined software engineering practices for data pipelines.
Key takeaway
For Data Engineers building PySpark pipelines, adopting a "shift left" testing strategy is crucial. By refactoring code into testable functions and integrating `pytest`, `chispa`, and local Spark into your CI/CD workflow, you can significantly reduce cloud spend, accelerate feedback loops, and prevent defects from reaching production. This approach builds trust in your data products and allows for safer, more confident refactoring.
Key insights
Shift left PySpark testing with unit tests and local Spark saves costs and improves code quality.
Principles
- Decouple business logic from I/O for testability.
- Automate testing in CI/CD to prevent production bugs.
Method
Refactor PySpark notebooks into pure Python functions. Use `pytest` with `chispa` for DataFrame assertions and a local Spark session for rapid, cost-free unit testing. Integrate tests into CI/CD via Azure DevOps and deploy with Databricks Asset Bundles.
In practice
- Use `pytest` for Python test discovery and execution.
- Employ `chispa` for PySpark DataFrame comparisons.
- Run tests with a local Spark session to save cloud costs.
Topics
- PySpark Unit Testing
- Data Engineering
- CI/CD Pipelines
- Databricks Asset Bundles
- Local Spark Testing
Best for: Data Engineer, MLOps Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Data Engineering on Medium.