Scaling Graph Analytics Without ETL: Inside PuppyGraph’s Architecture
Summary
PuppyGraph, co-founded by Weimo Liu, offers a "zero-copy" graph querying engine that runs Cypher and Gremlin traversals and algorithms directly on data in lakehouse formats like Iceberg, Delta, Hudi, Hive, and MongoDB, eliminating the need for separate graph stores. Its edge-sharded, vectorized, Massively Parallel Processing (MPP) architecture addresses challenges like hub nodes and multi-hop traversals, targeting sub-second to single-digit-second workloads. The platform supports flexible graph data modeling over normalized/denormalized tables and logical views, leveraging caching and Iceberg metadata for performance. PuppyGraph's operator-based engine unifies query and algorithms, finding applications in cybersecurity log analysis, entity resolution, and agentic workflows, while also clarifying when embedded or transactional graph databases are more suitable.
Key takeaway
For AI Architects or Data Engineers building agentic workflows or large-scale analytics, consider PuppyGraph to integrate graph capabilities directly into your lakehouse. This approach eliminates costly ETL, enabling real-time graph traversals and algorithms on existing Iceberg, Delta, or MongoDB data. You can simplify data pipelines and leverage graph patterns for complex entity resolution or cybersecurity log analysis without migrating data to a dedicated graph store, significantly reducing operational overhead and accelerating insights.
Key insights
PuppyGraph enables direct graph querying on existing lakehouse data, bypassing ETL for scalable, real-time analytics.
Principles
- Shard graph data by edges, not nodes, to manage supernodes.
- Decouple computation from storage for flexible graph processing.
- Columnar storage is crucial for memory-efficient OLAP graph queries.
Method
PuppyGraph uses an edge-sharded, vectorized, MPP architecture with node and edge operators. It leverages Iceberg metadata for adaptive caching and optimizes computation for sub-second to single-digit-second queries on large datasets.
In practice
- Run Cypher/Gremlin traversals on Iceberg, Delta, Hudi, Hive, MongoDB.
- Apply graph algorithms for cybersecurity log analysis and botnet detection.
- Utilize logical views for flexible graph schema mapping on existing tables.
Topics
- Graph Analytics
- Zero-Copy ETL
- Lakehouse Architecture
- Iceberg
- Cypher Query Language
- MPP Architecture
- Entity Resolution
Best for: CTO, VP of Engineering/Data, Director of AI/ML, Data Engineer, AI Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Data Engineering Podcast.