Slack Eliminates SSH in EMR Pipelines, Migrates 700+ Jobs to Rest-Based Architecture
Summary
Slack modernized its data platform by replacing SSH-based job execution with a REST-driven orchestration layer across its Amazon EMR pipelines. This initiative, completed over three quarters without downtime, migrated over 700 Airflow operators to a centralized job submission system, aiming to enhance security, reliability, and observability across eight data regions. Previously, Airflow operators used direct SSH connections to EMR master nodes, creating an expanded attack surface, increasing operational overhead for key management, and hindering consistent auditing and reliability. The new architecture leverages an internal orchestration layer called Quarry, where Airflow submits jobs via HTTP APIs. This enables a server-side job lifecycle with tracking and controlled cancellation, decoupling execution from client connectivity. Spark and Hive workloads transitioned using existing REST interfaces like Livy and HiveServer2, while arbitrary shell commands were supported via Apache Hadoop YARN's Distributed Shell capability. The migration revealed issues like YARN virtual memory enforcement and cross-account network connectivity gaps.
Key takeaway
For Data Engineers managing large-scale EMR data pipelines, if you are relying on direct SSH access for job execution, consider migrating to a REST-based orchestration layer. This approach significantly reduces your attack surface, streamlines operational overhead for key management, and improves job reliability and observability through centralized control. Plan an incremental rollout, leveraging tools like YARN's Distributed Shell for diverse workloads, and proactively discover network dependencies to ensure a smooth transition.
Key insights
Replacing direct SSH access with a REST-driven orchestration layer significantly enhances data pipeline security, reliability, and observability.
Principles
- Decoupling job execution from client connectivity improves system resilience.
- Centralized orchestration enhances security and auditability.
- Incremental migration reduces risk in complex platform changes.
Method
Implement a REST-based orchestration layer for job submission, utilizing existing interfaces like Livy/HiveServer2 and YARN's Distributed Shell for diverse workloads, with server-side lifecycle tracking.
In practice
- Transition EMR job execution from SSH to HTTP API calls.
- Leverage YARN's Distributed Shell for arbitrary command execution in containers.
- Monitor Airflow metadata to track SSH dependencies during phased rollouts.
Topics
- Data Platform Modernization
- Amazon EMR
- REST API Architecture
- Apache Airflow
- Security Hardening
- Distributed Shell
Best for: MLOps Engineer, CTO, VP of Engineering/Data, Data Engineer, DevOps Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by InfoQ.