Recently I answered a question over email about SELinux and container runtimes. Afterward, I realized that other people might be wondering about the same topic, so I decided to turn my answer into an article for Opensource.com, hoping I might be able to help other people who have the same question. The email began:
"Dan, you were kind enough to answer an SELinux question of mine some years back, and I'm hoping you're still in the business."
Although I now lead the containers team at Red Hat, and no longer work on the SELinux team, in a lot of ways I am still heavily involved in SELinux, since I have been working on getting SELinux and containers to work well together. SELinux and container technology are the perfect combination.
The email continued:
"I've just started to deal with some software that is containerized via Docker, and which is ordinarily only ever run on Ubuntu. Naturally this means nobody ever put any thought into how it will interact with SELinux.
"I know that containers get a pair of randomly chosen MCS [Multi-Category Security] labels by default, and that the files they create obviously end up with those same categories. However, when it's time to rebuild or upgrade the container, the files are now inaccessible because the new container has a different pair of categories.
"Are we supposed to relabel these files with the new categories? Or do we have to pick the categories ourselves and then use Docker's
--security-optoption when we run the container? How do we do so without risk that some other container will end up with the same categories?"
Regarding the first question, when a container runtime like Docker, as well as some of the new ones we have been working on—podman, CRI-O, and Buildah—create a container, they pick a random MCS label to run the container. The MCS labels consist of two random numbers between 0 and 1,023 and have to be unique. They are prefixed with a
c or category. SELinux also needs a sensitivity level
So an MCS label looks like
s0:c1,c2. Note that
s0:c2,c1 is the same thing. Also, the two numbers may not be the same; SELinux would translate
s0:c1. This gives us approximately
(1024 * 1024) /2 -1024 categories—about 500,000 unique containers on a host.
We originally created MCS labeling back in 2008 for virtual machines, and it was often referred to as sVirt. We figured that running a half-million VMs on a single machine would not happen for a few years. With containers, the number might end up being threatened. But we could always go to three or more categories for each label, although the algorithm becomes more complicated.
SELinux does more than just MCS label. The process and content also get assigned SELinux "Types." Processes usually run with the
container_t type, and content is created with the
Note: I wrote SELinux/MCS golang bindings to natively implement the SELinux interfaces to set up labeling. These bindings were contributed to the Open Containers Initiative (OCI).
The second part of the question asks about the content created by the container on disk. The writer is correct that the content on volumes will be labeled with the content label
But he makes a false assumption that when the container restarts, it will choose a different MCS label and therefore will no longer be able to read the content. Container runtimes do not destroy the "container" when the processes in the container stop. They record the information on how to run the container, including the SELinux labels used to run them. So, when you stop and start a container, it will always run with the same MCS label. Not only that, but the container runtimes also read their database or existing containers when they start, reserving the MCS labels that are already used, so they can guarantee that all new containers will not conflict with already reserved MCS labels.
You can override this behavior. If you wanted to create a second container that was able to access the data created by the first, you could tell the container runtime to use the same MCS label.
# podman run -ti -v /var/lib/previouscontainer:/var/lib/db --security-
opt label=level:s0:c1,c2 fedora sh
# docker run -ti -v /var/lib/previouscontainer:/var/lib/db --security-
opt label=level:s0:c1,c2 fedora sh
Now, if you remove a container from the container runtime and leave the content on disk, there is a chance the label will be reused. The best thing to do with this content is to change the type of the content when the container is removed. The command
restorecon -rF /var/lib/previouscontainer will force the label of the content back to a label that containers can't read/write.
After reading my response, the email author replied with another question:
"Oh, and what if I were to have containers created by both Docker and libvirt?"
I could interpret this question in two different ways. One would be that he is worried about containers created via Docker and VMs created by libvirt. The simple answer to this question is: Even though both use the same MCS ranges for labeling, they use different types. Libvirt uses
svirt_t (Process) and
svirt_image_t (Files), and SELinux would maintain separation based on type enforcement.
But another way to look at the question would be to look at libvirt-lxc, which also creates containers. Other toolchains using container-selinux include our new toolchains CRI-O, podman, and Buildah, as well as RKT and Systemd-nspawn, and even lxc tools take advantage of this.
Docker does not share the MCS datastore with any of these other tools, so it is better to not run them on the same machine at the same time, or to use a higher level tool like OpenShift or Kubernetes to select the SELinux/MCS Labels to run in the containers, in order to have it guarantee uniqueness. (See this OpenShift example for more details.)
Our new container runtime tools CRI-O, Buildah, and podman all share the same database. You can run all three of them at the same time on the same host and not have to worry about conflicts.
SELinux provides great filesystem separation for your container runtimes, but you need to be careful when running multiple container runtimes on the same machine at the same time, and also careful to clean up any content left on a host when you remove a container.