This article describes recent work done at NERSC in collaboration with Red Hat to modify Podman (the pod manager tool) to run at a large scale, a key requirement for high-performance computing (HPC). Podman is an open source tool for developing, managing, and running containers on Linux systems. For more details about this work, please see our paper which will be published in the CANOPIE-HPC Supercomputing 2022 proceedings.
In the following demo video, we walk through pulling an image onto Perlmutter from the NERSC registry, generating a squashed version of the image using podman-hpc, and running the EXAALT benchmark at large scale (900 nodes, 3600 GPUs) via our podman-exec
wrapper. NERSC's flagship supercomputing system is Perlmutter, currently number 7 on the Top 500 list. It has a GPU partition with over 6000 NVIDIA A100 GPUs and a CPU partition with over 6000 AMD Milan CPUs. All of the work described in this blog post has been performed on Perlmutter.
NERSC, the National Energy Research Scientific Computing center, is the US Department of Energy's production mission computing facility that serves the DOE Office of Science, which funds a wide range of fundamental and applied research. In the first half of 2022, more than 700 unique users used Shifter, the current container solution at NERSC, and general user interest in containers is growing.
Although NERSC has demonstrated near bare metal performance with Shifter at large scales, several shortcomings have motivated us to explore Podman. The primary factor is that Shifter does not provide any build utilities. Users must build containers on their own local system and ship their images to NERSC via a registry. Another obstacle is that Shifter provides security by limiting the running container to the privileges of the user who launched it. Finally, Shifter is mostly an "in-house" solution, so users must learn a new technology, and NERSC staff have the additional burden of maintaining this software.
Podman provides a solution to all of these major pain points. Podman is an OCI-compliant framework that adheres to a set of community standards. It will feel familiar to users who have used other OCI-compliant tools like Docker. It also has a large user and developer community with more than 15k stars on GitHub as of October 2022. The major innovation that has drawn us to Podman is rootless containers. Rootless containers elegantly constrain privileges by using a subuid/subgid map to enable the container to run in the user namespace but with what feels like full root privileges. Podman also provides container build functionality that will allow users to build images directly on the Perlmutter login nodes, removing a major roadblock in their development workflows.
[ Check out the latest Podman articles on Enable Sysadmin. ]
Enabling Podman at a large scale on Perlmutter with near-native performance required us to address site integration, scalability, and performance. Additionally, we have developed two wrapper scripts to achieve two modes of operation: Podman container-per-process
and podman-exec
. Podman container-per-process
mode describes the situation in which many processes are running on the node (usually in an MPI application), with one individual container running for each process. The podman-exec
mode describes the situation in which there is a single container running per node, even if there are multiple MPI processes.
We ran several benchmarks with podman-hpc
on Perlmutter to measure the performance of bare metal implementations: Shifter, Podman container-per-process
, and podman-exec
mode. The EXAALT benchmark runs the LAMMPS molecular dynamics application, the Pynamic benchmark simulates Python package imports and function invocations, and the DeepCAM benchmark is a climate data segmentation deep learning application. In general, the benchmarks suggest comparable performance between bare metal, Shifter, and podman-exec
cases. The startup overhead incurred in Podman container-per-process
can be seen in the results of both Pynamic and DeepCAM. In general, podman-exec
was our best performing configuration, so this is the mode on which we will focus our future development efforts.
Results from our strong-scaling EXAALT benchmark at 32, 64, 128, and 256 nodes. The average of two bare metal run results are shown in red, Shifter run results are shown in blue, Podman container-per-process
run results are shown in dark green, and podman-exec
mode results are shown in light green with corresponding error bars.
The results of the Pynamic benchmark for bare metal (red), Shifter (blue), podman-exec
mode (green), and Podman container-per-process
mode (light-green) over two job sizes (128 and 256 nodes) using 64 tasks per node. All configurations were run three times.
The results of the MLPerf TM DeepCAM strong scaling benchmark for Shifter (blue), Podman container-per-process
(light green), and podman-exec
mode (dark green) over a range of job sizes (16, 32, 64, and 128 Perlmutter GPU nodes). We separate the timing data into container startup, training startup, and training runtime.
We are excited about the results we have seen so far, but we still have work to do before we can open Podman to all NERSC users. To improve the user experience, we aim to explore adding Slurm integration to remove some of the complexity of working with nested wrapper scripts, especially for the podman-exec
case. We also aim to get our podman-hpc
scripts and binaries into the Perlmutter boot images of all nodes, so staging these to each node will no longer be necessary. We hope to address some of the limitations of the OCI hook functionality (for example, the inability to set environment variables) with the OCI community. Finally, our goal is to get much of our work upstreamed into Podman itself so the larger Podman community can leverage our work.
These results have not been verified by the MLCommons Association. Note that we are measuring the timing for N epochs, which is not the official MLPerf measurement methodology.
1 Comment