Scaling ML Inference on Databricks: Liquid or Partitioned? Salted or Not?

· Source: Towards Data Science · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Cloud Computing & IT Infrastructure · Depth: Advanced, long

Summary

An analysis of ML inference scaling on Databricks for a pipeline predicting a continuous variable across four products, totaling ~550M rows, revealed significant performance bottlenecks due to data layout. An initial 420-core cluster spent nearly 10 hours processing only 18 partitions. The study compared four data treatment scenarios: non-salted partitioned, non-salted liquid-clustered, salted partitioned, and salted liquid-clustered tables. The non-salted approaches resulted in average partition sizes of tens of millions of rows, leading to severe task skew and incomplete jobs. Implementing a dynamic salting strategy, which distributes data based on product volume and enforces a 1M row limit per file, drastically improved parallelism. The salted partitioned approach reduced Product D's inference time to 3 hours with 860 partitions. The salted liquid-clustered approach further optimized performance by maintaining balanced task distribution and reducing maximum task duration, emerging as the most robust setup for scalable ML inference.

Key takeaway

For MLOps Engineers optimizing large-scale ML inference pipelines on Databricks, prioritize data layout strategies over solely increasing cluster size. Implement dynamic salting combined with liquid clustering and a "maxRecordsPerFile" limit (e.g., 1M rows) to maximize parallelism, reduce task skew, and ensure predictable, scalable inference runtimes, especially for imbalanced datasets. This approach can significantly cut processing times and operational costs.

Key insights

Data layout, specifically file partitioning and row limits, critically impacts ML inference scalability on Databricks.

Principles

Method

The method involves dynamically generating a salt based on product data volumes, repartitioning data using this salt, and enforcing a maximum of 1M rows per file during write operations to optimize inference parallelism.

In practice

Topics

Best for: Machine Learning Engineer, MLOps Engineer, Data Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.