Processing 1 TB with DuckDB in less than 30 seconds

2024-04-11 · Source: DataExpert.io Newsletter · Field: Technology & Digital — Data Science & Analytics, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Intermediate, medium

Summary

This article challenges the conventional wisdom that DuckDB is only suitable for "small" datasets, demonstrating its capability to process terabytes of data efficiently. Benchmarks were conducted on a 1TB dataset, first locally on a Mac M2 Pro with 16GB RAM, where a common aggregation query averaged 1 minute and 29 seconds. Subsequently, the same 1TB dataset was processed on MotherDuck using a "Mega" compute capacity, achieving an average query time of under 17 seconds. Further optimization using DuckDB's Zonemap index with pre-sorted data reduced the average query time to under 10 seconds, showcasing a 30% improvement. The author generated the 1TB dataset using Python's ProcessPoolExecutor for parallel processing, creating 400 Parquet files, each approximately 2.76GB.

Key takeaway

For Data Engineers evaluating distributed compute solutions, DuckDB and MotherDuck offer a compelling alternative to Spark for terabyte-scale data processing. You should benchmark DuckDB on your larger datasets, especially considering cloud deployment and implementing Zonemap indexes by pre-sorting data, to potentially achieve sub-30-second query times for complex aggregations and significantly reduce infrastructure costs and complexity.

Key insights

DuckDB can efficiently process terabyte-scale datasets, challenging its "small data" perception.

Principles

DuckDB scales beyond 20GB datasets.
Cloud platforms enhance DuckDB performance.
Data sorting improves query speed via Zonemap.

Method

Generate 1TB Parquet data in parallel using Python's ProcessPoolExecutor, then benchmark aggregation queries on local DuckDB and MotherDuck, comparing unsorted and Zonemap-indexed data.

In practice

Use DuckDB for datasets >20GB.
Consider MotherDuck for TB-scale analytics.
Sort data by query fields for performance gains.

Topics

DuckDB Performance
Terabyte Data Processing
MotherDuck
Query Optimization
Distributed Compute Benchmarking

Code references

mattmartin14/dream_machine

Best for: Data Engineer, Data Scientist, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by DataExpert.io Newsletter.