Skipper: Building Airbnb’s embedded workflow engine
Summary
Airbnb developed Skipper, an embedded workflow engine designed to provide durable execution for critical "Tier 0" services without the operational overhead of external orchestration clusters or cloud-managed solutions. It addresses the problem of fragmented domain logic and bespoke retry systems by offering a shared library that integrates directly into existing services. Skipper leverages current infrastructure, using databases like MySQL or Airbnb's Unified Data Store for state persistence, and features a simple Java/Kotlin programming model with annotation-based contracts. It ensures workflow completion through a replay mechanism with checkpointed actions, where actions' results survive crashes and restarts, and compensation methods handle failures. This engine has been in production for over a year, powering more than 15 use cases across insurance, payments, and media processing, and has scaled to 10,000 workflows per second on Amazon DynamoDB.
Key takeaway
For Software Engineers building durable distributed systems, if minimizing external dependencies and operational overhead is paramount, consider an embedded workflow engine approach. This model allows your service to manage its own workflow processing, using existing databases and reducing single points of failure. You gain simplified development with familiar programming models, but must ensure workflow determinism and action idempotency to handle replays and potential at-least-once execution.
Key insights
Embedded workflow engines offer durable execution with minimal overhead by utilizing existing service infrastructure.
Principles
- Workflow code should mirror business logic.
- Avoid central points of failure.
- Utilize existing service infrastructure.
Method
Skipper defines Workflows for orchestration logic and Actions for individual operations. Actions are checkpointed, and durability is achieved via replay, where previously executed actions return checkpointed results instantly. Compensation methods undo effects of failed actions.
In practice
- Coordinate video processing pipelines.
- Manage Flink job lifecycles.
- Orchestrate multi-step claim processing.
Topics
- Workflow Engine
- Durable Execution
- Embedded Architecture
- Distributed Systems
- Java/Kotlin
- Idempotency
Best for: Software Engineer, DevOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Airbnb Tech Blog - Medium.