Your Data, Your Lake: How Observe Uses Iceberg and Streaming ETL for Observability
Summary
Observe cofounder and CTO Jacob Leverich discusses applying lakehouse architectures to observability workloads, emphasizing cloud-native warehousing and open table formats like Iceberg for scalability and cost efficiency. He highlights how this approach, combined with streaming ingest via OpenTelemetry, Kafka-backed durability, curated/columnarized tables, and query orchestration, addresses common pain points such as fragmented tools, high costs, and data silos. The system delivers low-latency, interactive troubleshooting across logs, metrics, and traces at petabyte scale. Leverich also details the practicalities of organizing telemetry by use case to minimize read amplification and the significance of Iceberg v3's JSON shredding capabilities, enabling a "your data in your lake" strategy.
Key takeaway
For CTOs and AI Architects evaluating observability solutions, consider lakehouse architectures like Observe's approach. This strategy centralizes diverse telemetry data, reduces costs, and enhances troubleshooting by leveraging open table formats and streaming ETL. Your teams can gain unified access to petabytes of data, improving MTTR and enabling advanced AI-driven analytics without the typical constraints of fragmented, expensive legacy systems.
Key insights
Lakehouse architectures can provide scalable, cost-efficient observability by centralizing diverse telemetry data.
Principles
- Organize data by use case to minimize read amplification.
- Streaming ETL is crucial for low-latency observability.
- Open table formats enable data ownership and multi-tool access.
Method
Ingest OpenTelemetry data via Kafka for durability and batching, then stream-process into curated, columnarized Iceberg tables. Abstract SQL queries into optimized sequences for interactive performance.
In practice
- Deploy OpenTelemetry collectors for vendor-neutral data collection.
- Utilize Kafka for buffering and efficient batch loading into lakehouses.
- Curate data into specific tables (e.g., VPC flow logs) to optimize queries.
Topics
- Observability
- Lakehouse Architecture
- Apache Iceberg
- Streaming ETL
- OpenTelemetry
Best for: CTO, VP of Engineering/Data, AI Architect, Data Engineer, MLOps Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Data Engineering Podcast.