Slack Eliminates SSH in EMR Pipelines, Migrates 700+ Jobs to Rest-Based Architecture

· Source: InfoQ · Field: Technology & Digital — Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Intermediate, quick

Summary

Slack modernized its data platform by replacing SSH-based job execution with a REST-driven orchestration layer across its Amazon EMR pipelines. This initiative, completed over three quarters without downtime, migrated over 700 Airflow operators to a centralized job submission system, aiming to enhance security, reliability, and observability across eight data regions. Previously, Airflow operators used direct SSH connections to EMR master nodes, creating an expanded attack surface, increasing operational overhead for key management, and hindering consistent auditing and reliability. The new architecture leverages an internal orchestration layer called Quarry, where Airflow submits jobs via HTTP APIs. This enables a server-side job lifecycle with tracking and controlled cancellation, decoupling execution from client connectivity. Spark and Hive workloads transitioned using existing REST interfaces like Livy and HiveServer2, while arbitrary shell commands were supported via Apache Hadoop YARN's Distributed Shell capability. The migration revealed issues like YARN virtual memory enforcement and cross-account network connectivity gaps.

Key takeaway

For Data Engineers managing large-scale EMR data pipelines, if you are relying on direct SSH access for job execution, consider migrating to a REST-based orchestration layer. This approach significantly reduces your attack surface, streamlines operational overhead for key management, and improves job reliability and observability through centralized control. Plan an incremental rollout, leveraging tools like YARN's Distributed Shell for diverse workloads, and proactively discover network dependencies to ensure a smooth transition.

Key insights

Replacing direct SSH access with a REST-driven orchestration layer significantly enhances data pipeline security, reliability, and observability.

Principles

Method

Implement a REST-based orchestration layer for job submission, utilizing existing interfaces like Livy/HiveServer2 and YARN's Distributed Shell for diverse workloads, with server-side lifecycle tracking.

In practice

Topics

Best for: MLOps Engineer, CTO, VP of Engineering/Data, Data Engineer, DevOps Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by InfoQ.