Shift Left Your PySpark: How Unit Tests, Local Spark, and CI/CD Turn “It Runs” into “It Works”

· Source: Data Engineering on Medium · Field: Technology & Digital — Software Development & Engineering, Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, medium

Summary

The article advocates for a "shift left" approach to PySpark development, moving testing from expensive, slow integration tests in production to early-stage unit testing. It highlights the problem of traditional data engineering practices, where code is often tested directly in Databricks notebooks, leading to long feedback cycles and wasted cloud compute. The proposed solution involves refactoring monolithic notebook code into small, pure Python functions that transform DataFrames, enabling isolated unit tests. Key tools for this approach include `pytest` for test orchestration, `chispa` for expressive DataFrame assertions, and local Spark sessions for cost-free, rapid test execution. The article also details integrating these unit tests into CI/CD pipelines using Azure DevOps and deploying tested code via Databricks Asset Bundles (DABs), ensuring that buggy code is caught before reaching production and promoting disciplined software engineering practices for data pipelines.

Key takeaway

For Data Engineers building PySpark pipelines, adopting a "shift left" testing strategy is crucial. By refactoring code into testable functions and integrating `pytest`, `chispa`, and local Spark into your CI/CD workflow, you can significantly reduce cloud spend, accelerate feedback loops, and prevent defects from reaching production. This approach builds trust in your data products and allows for safer, more confident refactoring.

Key insights

Shift left PySpark testing with unit tests and local Spark saves costs and improves code quality.

Principles

Method

Refactor PySpark notebooks into pure Python functions. Use `pytest` with `chispa` for DataFrame assertions and a local Spark session for rapid, cost-free unit testing. Integrate tests into CI/CD via Azure DevOps and deploy with Databricks Asset Bundles.

In practice

Topics

Best for: Data Engineer, MLOps Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Data Engineering on Medium.