I showed how you could do some awesome stuff, including running containers with lots of different user IDs (UIDs), installing software, setting up networking, and running containers at Quay.io, Docker.io, or pretty much any other container registry.
That said, rootless containers are not a panacea. There are a lot of shortcomings, and people need to understand what can go wrong.
Volume mounting other content
I recently replied to an issue about Podman on GitHub. The user was attempting to run Plex in a container and wanted to volume mount /run into the container. He knew to disable the privilege separation since SELinux would block the use of /run within the container. When he ran the container with Podman as root, it worked perfectly. But when he ran the container as non-root, it blew up with an error:
to /tmp/runctop091524734/runctmpdir731776453: open
3d1fa08658a44c40f43bd950a17a/merged/run/lock/lvm: permission denied\"""
: internal libpod error
This error indicates that a process inside the container attempted to open the lvm file in /run/lock within the container and failed, returning permission denied. The user understandably was confused since the container was running in privileged mode.
"Doesn't privileged mean the container has full root?"
Why did it fail?
The reason this failed is the container is running in a user namespace. The process running the container is still running with its real UID, even though the container reports it as root. Running in rootless containers gives the user no additional rights on the host other than allowing them to use a few additional UIDs defined in the /etc/subuid and /etc/subgid files.
If the user did not mount /run into the container, then this failure would not have happened because /run would have been created with the user's UID. And all the contents in /run would be owned by the user.
If you volume-mount content from the host into a rootless container, then you need to make sure the content is readable by the user without being root and, if the container needs to write to the volume mount, that it's owned either by the user's UID or one that is listed in /etc/subuid or /etc/subgid as being allocated for use by the user.
At Red Hat Summit 2019, we ran a great lab on container security that illustrated all the ways you can interact with containers from a security point of view. One of the labs involves running a container with the Network Time Protocol daemon (ntpd) inside. The ntpd program attempts to modify the system time on the host running the container. The command fails when it runs in an unprivileged container as root unless you start the container using a command that looks like this:
sudo podman run -d --cap-add SYS_TIME ntpd
Podman will execute this container with the CAP_SYS_TIME capability allowed in the container, which allows processes running in it to modify the system time.
CAP_SYS_TIME Set system clock (settimeofday(2), stime(2), adjtimex(2)); set real-time (hardware) clock.
When users attempt to run this exact command in rootless mode, it fails. Why?
If the user examines the capabilities inside the container, it shows the container has CAP_SYS_TIME, so why does it still get permission denied?
Again, running rootless containers does not give your container any special privileges that your processes would not have outside. When running in rootless containers, you get user namespaced capabilities. These namespaced capabilities allow the root process to perform some privileged operations while inside the container. But changing the system time is not permitted; this requires the real CAP_SYS_TIME system capability.
Since there is no namespaced time, this capability is somewhat useless to the container, so people usually complain: why have capabilities at all? It's because a lot of the capabilities are still useful. For example, CAP_SETUID and CAP_SETGID allow the processes inside the container to change their UID and group identifier (GID) to any UID or GID defined within the container. Modifying UIDs and GIDs of processes outside the container is still denied. There are many other examples of things that are allowed only if the process has user namespaced capabilities.
Binding to ports less than 1024
The last example of a shortcoming in rootless Podman is the ability to listen for incoming connections on the host on a port less than 1024. This is really just another example of user namespaced capabilities.
For example, if you want to run a container and have it listen on port 80 on the host, you need to run it as root, or at least with the CAP_NET_BIND_SERVICE capability.
CAP_NET_BIND_SERVICE Bind a socket to Internet domain privileged ports (port numbers less than 1024).
sudo podman run -d --net=host httpd
works fine and binds to port 80 on the host. By default, enabling Podman to run containers as root allows the CAP_NET_BIND_SERVICE capability. But if you run Podman as an unprivileged user, this will be blocked. For example,
podman run -d --net=host httpd
will fail with permission denied, because the user process is not allowed to bind to ports <1024 on the host because it does not have the CAP_NET_BIND_SERVICE capability over the host's network namespace.
podman run -d httpd
should work because it is creating a network namespace, and the root process within the user namespace has CAP_NET_BIND_SERVICE for the network namespace created within the container. This port, however, is not port 80 on the host, but port 80 on the container's network address.
We keep track of these problems on the Shortcomings of Rootless Podman GitHub page.
Running rootless Podman and Buildah can do most things people want to do with containers, but there are times when root is still required. Sometimes it is difficult to know why you are getting permission denied, but hopefully, this article illustrates some of the main causes. Understanding them will help you troubleshoot problems and change your design accordingly.