HPC containers at scale using Podman

Learn how Podman is being modified to run at a large scale for high-performance computing (HPC).

Image by:

Opensource.com

This article describes recent work done at NERSC in collaboration with Red Hat to modify Podman (the pod manager tool) to run at a large scale, a key requirement for high-performance computing (HPC). Podman is an open source tool for developing, managing, and running containers on Linux systems. For more details about this work, please see our paper which will be published in the CANOPIE-HPC Supercomputing 2022 proceedings.

In the following demo video, we walk through pulling an image onto Perlmutter from the NERSC registry, generating a squashed version of the image using podman-hpc, and running the EXAALT benchmark at large scale (900 nodes, 3600 GPUs) via our podman-exec wrapper. NERSC's flagship supercomputing system is Perlmutter, currently number 7 on the Top 500 list. It has a GPU partition with over 6000 NVIDIA A100 GPUs and a CPU partition with over 6000 AMD Milan CPUs. All of the work described in this blog post has been performed on Perlmutter.

NERSC, the National Energy Research Scientific Computing center, is the US Department of Energy's production mission computing facility that serves the DOE Office of Science, which funds a wide range of fundamental and applied research. In the first half of 2022, more than 700 unique users used Shifter, the current container solution at NERSC, and general user interest in containers is growing.

Although NERSC has demonstrated near bare metal performance with Shifter at large scales, several shortcomings have motivated us to explore Podman. The primary factor is that Shifter does not provide any build utilities. Users must build containers on their own local system and ship their images to NERSC via a registry. Another obstacle is that Shifter provides security by limiting the running container to the privileges of the user who launched it. Finally, Shifter is mostly an "in-house" solution, so users must learn a new technology, and NERSC staff have the additional burden of maintaining this software.

Podman provides a solution to all of these major pain points. Podman is an OCI-compliant framework that adheres to a set of community standards. It will feel familiar to users who have used other OCI-compliant tools like Docker. It also has a large user and developer community with more than 15k stars on GitHub as of October 2022. The major innovation that has drawn us to Podman is rootless containers. Rootless containers elegantly constrain privileges by using a subuid/subgid map to enable the container to run in the user namespace but with what feels like full root privileges. Podman also provides container build functionality that will allow users to build images directly on the Perlmutter login nodes, removing a major roadblock in their development workflows.

[ Check out the latest Podman articles on Enable Sysadmin. ]

Enabling Podman at a large scale on Perlmutter with near-native performance required us to address site integration, scalability, and performance. Additionally, we have developed two wrapper scripts to achieve two modes of operation: Podman container-per-process and podman-exec. Podman container-per-process mode describes the situation in which many processes are running on the node (usually in an MPI application), with one individual container running for each process. The podman-exec mode describes the situation in which there is a single container running per node, even if there are multiple MPI processes.

We ran several benchmarks with podman-hpc on Perlmutter to measure the performance of bare metal implementations: Shifter, Podman container-per-process, and podman-exec mode. The EXAALT benchmark runs the LAMMPS molecular dynamics application, the Pynamic benchmark simulates Python package imports and function invocations, and the DeepCAM benchmark is a climate data segmentation deep learning application. In general, the benchmarks suggest comparable performance between bare metal, Shifter, and podman-exec cases. The startup overhead incurred in Podman container-per-process can be seen in the results of both Pynamic and DeepCAM. In general, podman-exec was our best performing configuration, so this is the mode on which we will focus our future development efforts.

Performance results for EXAALT carbon analysis

Image by:

(Laurie Stephey, CC BY-SA 4.0)

Results from our strong-scaling EXAALT benchmark at 32, 64, 128, and 256 nodes. The average of two bare metal run results are shown in red, Shifter run results are shown in blue, Podman container-per-process run results are shown in dark green, and podman-exec mode results are shown in light green with corresponding error bars.

Image by:

(Laurie Stephey, CC BY-SA 4.0)

The results of the Pynamic benchmark for bare metal (red), Shifter (blue), podman-exec mode (green), and Podman container-per-process mode (light-green) over two job sizes (128 and 256 nodes) using 64 tasks per node. All configurations were run three times.

Image by:

(Laurie Stephey, CC BY-SA 4.0)

Linux Containers

What are Linux containers?

What is Kubernetes?

Free online course: Deploy containerized applications

eBook: A guide to Kubernetes for SREs and sysadmins

Free online course: Running containers with Red Hat technical overview

Podman cheat sheet

The latest articles on Linux containers

The results of the MLPerf TM DeepCAM strong scaling benchmark for Shifter (blue), Podman container-per-process (light green), and podman-exec mode (dark green) over a range of job sizes (16, 32, 64, and 128 Perlmutter GPU nodes). We separate the timing data into container startup, training startup, and training runtime.

We are excited about the results we have seen so far, but we still have work to do before we can open Podman to all NERSC users. To improve the user experience, we aim to explore adding Slurm integration to remove some of the complexity of working with nested wrapper scripts, especially for the podman-exec case. We also aim to get our podman-hpc scripts and binaries into the Perlmutter boot images of all nodes, so staging these to each node will no longer be necessary. We hope to address some of the limitations of the OCI hook functionality (for example, the inability to set environment variables) with the OCI community. Finally, our goal is to get much of our work upstreamed into Podman itself so the larger Podman community can leverage our work.

These results have not been verified by the MLCommons Association. Note that we are measuring the timing for N epochs, which is not the official MLPerf measurement methodology.

1 Comment

This work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.

HPC containers at scale using Podman

1 Comment

Related Content