Scaling ML Inference on Databricks: Liquid or Partitioned? Salted or Not?
Summary
An analysis of ML inference scaling on Databricks for a pipeline predicting a continuous variable across four products, totaling ~550M rows, revealed significant performance bottlenecks due to data layout. An initial 420-core cluster spent nearly 10 hours processing only 18 partitions. The study compared four data treatment scenarios: non-salted partitioned, non-salted liquid-clustered, salted partitioned, and salted liquid-clustered tables. The non-salted approaches resulted in average partition sizes of tens of millions of rows, leading to severe task skew and incomplete jobs. Implementing a dynamic salting strategy, which distributes data based on product volume and enforces a 1M row limit per file, drastically improved parallelism. The salted partitioned approach reduced Product D's inference time to 3 hours with 860 partitions. The salted liquid-clustered approach further optimized performance by maintaining balanced task distribution and reducing maximum task duration, emerging as the most robust setup for scalable ML inference.
Key takeaway
For MLOps Engineers optimizing large-scale ML inference pipelines on Databricks, prioritize data layout strategies over solely increasing cluster size. Implement dynamic salting combined with liquid clustering and a "maxRecordsPerFile" limit (e.g., 1M rows) to maximize parallelism, reduce task skew, and ensure predictable, scalable inference runtimes, especially for imbalanced datasets. This approach can significantly cut processing times and operational costs.
Key insights
Data layout, specifically file partitioning and row limits, critically impacts ML inference scalability on Databricks.
Principles
- Inference scalability is often limited by data layout.
- Partitioning alone is insufficient for large-scale inference.
- Salting unlocks parallelism and stabilizes runtimes.
Method
The method involves dynamically generating a salt based on product data volumes, repartitioning data using this salt, and enforcing a maximum of 1M rows per file during write operations to optimize inference parallelism.
In practice
- Use dynamic salting to distribute data according to volume.
- Enforce a "maxRecordsPerFile" option during data writes.
- Combine salting with liquid clustering for adaptive file layout.
Topics
- ML Inference Scaling
- Databricks Optimization
- Data Partitioning
- Liquid Clustering
- Data Salting
Best for: Machine Learning Engineer, MLOps Engineer, Data Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.