What It Actually Takes to Run Code on 200M€ Supercomputer

· Source: Towards Data Science · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Data Science & Analytics · Depth: Intermediate, quick

Summary

MareNostrum V, one of the world's fifteen most powerful supercomputers, located at the Polytechnic University of Catalonia, offers a unique High-Performance Computing (HPC) environment. This 200M€ machine, a joint investment from EuroHPC, Spain, Portugal, and Turkey, features a General Purpose Partition with 6,408 Intel Sapphire Rapids nodes (45.9 PFlops) and an Accelerated Partition with 1,120 NVIDIA H100 SXM GPU nodes (260 PFlops). Its architecture utilizes an InfiniBand NDR200 fat-tree topology to ensure non-blocking, low-latency communication across 8,000 nodes. MareNostrum V also integrates quantum infrastructure, including a digital gate-based system and a MareNostrum-Ona quantum annealer, for hybrid classical-quantum computing. Access is managed via SLURM, a workload manager that queues and executes jobs, enforcing strict resource limits and an airgapped network for compute nodes.

Key takeaway

For data scientists or ML engineers considering large-scale simulations or model training, understanding HPC architecture and operational rules is crucial. You must adapt to an airgapped environment, strict job scheduling via SLURM, and resource quotas. Plan your workflows to pre-compile code and manage data transfer, recognizing that communication overhead and Amdahl's Law will dictate optimal parallelism, rather than simply maximizing core count.

Key insights

Supercomputers like MareNostrum V are distributed systems requiring specialized architectural and operational understanding.

Principles

Method

Interact with supercomputers via SSH to login nodes, use SLURM to define and submit job scripts with resource directives like `--nodes`, `--ntasks`, and `--time`, and manage data via `scp` or `rsync`.

In practice

Topics

Best for: Data Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.