14,000 Files, 40GB: Fixing a Slow Delta Table in Azure Synapse
Summary
A 40GB Delta table in Azure Synapse experienced severe performance degradation, with Spark queries taking almost a minute before data reads commenced. The root cause was identified as an excessive number of small Parquet files, totaling around 14,000, many under 2MB, accumulated from months of hourly merge jobs. This high file count forced Spark to spend significant time processing the Delta transaction log and building query plans across numerous small files. The article details how applying Delta Lake maintenance operations like "OPTIMIZE", "Z-ORDER", and "VACUUM", along with strategic partitioning, effectively resolved the performance bottleneck by consolidating these fragmented files.
Key takeaway
For Data Engineers managing Delta Lake tables in Azure Synapse, proactively monitoring file counts is crucial to prevent performance bottlenecks. If your Spark queries exhibit slow startup times on Delta tables, investigate the underlying file fragmentation. Implement regular "OPTIMIZE" and "VACUUM" operations, potentially with "Z-ORDER" on key columns, to consolidate small files and maintain efficient query planning. This prevents accumulated small files from degrading overall data processing speed.
Key insights
Excessive small files in Delta tables significantly degrade Spark query performance by increasing transaction log processing time.
Principles
- Delta tables require regular file compaction.
- Small files increase query planning overhead.
- Transaction logs impact query startup time.
Method
Identify slow Delta table queries, check file counts, then apply "OPTIMIZE" (with "Z-ORDER" for query columns) and "VACUUM" commands to consolidate files and clean up old versions.
In practice
- Run "OPTIMIZE" on frequently updated tables.
- Use "Z-ORDER" on high-cardinality columns.
- Schedule "VACUUM" for log cleanup.
Topics
- Azure Synapse
- Delta Lake
- Apache Spark
- Data Optimization
- File Compaction
- Data Lakes
Best for: Data Engineer, Analytics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Data Engineering on Medium.