One of the main goals at both Red Hat and at Docker is to make this statement less true. My team at Red Hat is continuing to try to take advantage of other security mechanisms to make containers more secure. These are a few of the security features we are working at implementing and how they might affect Docker and containers in the future.
User namespace is a kernel namespace that should allow us to get better separation between the host and the container.
The basic idea is you could create a range of UIDs (for example, 60,000-61,000) that you can map into the user namespace as 0-1000. You can also do this with GIDs. The kernel would treat UID 0 inside of the container as UID 60,000 outside of the container. Any UID on a file or a process that is not in the mapped range would be treated as UID=-1 and not be accessible in the container. This includes the entire base image. If you wanted to use a base image with a user namespace then you need to change all of the UID's within the base image to the new Root user. Other problems with User Namespace is that volumes mounted into the container with a file owned by UID 0 would not be accessible within the container. You would have to
chown all of the content you want in the container to something owned by the UID range.
chown -R 60000:60000 /var/lib/content
Another problem with user namespaces is that if you wanted to use them for separation between containers, you need to have a different range of UIDs for each container. If you had hundreds of containers, you would need hundreds of ranges. This also becomes a problem with going to shared storage between container hosts.
One of the cool things about user namespace is that they allow the use of namespaced capabilities. If you put a container inside of a user namespace, it no longer needs real system capabilities. This means we could adjust the code to drop all system capabilities when a container starts a user namespace. It also allows us to drop all capabilities from the SELinux label.
I see at least three different potential use cases for user namespaces.
- Improve general separation between containers, in such a way that we could turn off all Linux capabilities for a container. Doing so would tighten up security of the system from containers, but would not necessarily improve separation between containers. In this mode, I would envision we would pick one UID for DOCKERROOT, then set up all containers to uses this. For example, if DOCKERROOT was UID=2, I would set up a mapping for UID0=2 and GID0=2, and then map all UIDS greater than two to themselves. For example, 3-MAX_UID=3-MAX_UID, and we would do this similarly for GID. By doing this, we have eliminated ability for a container to attack root. It is also a lot simpler for volume mounts.
I have suggested that maybe we could try to just use the capability drop within the user namespace by default, for example, by mapping UID 0-65,0000 to UID 0-65,0000. Then, if you volume mounted a file owned by root into the container, it would work, but the processes outside the container would not have any capabilities. By doing this, we can experiment with using user namespaces in a sane way.
- The OpenShift method: all files within a container get mapped to a single UID/GID pair. Every user on the system gets a different UID. The main reason for this would be when the user container required processes to run with a kernel capability. Otherwise, the use of user namespace adds little.
- Each container gets a separate UID Range mapping from every other container. This gives you the ability to run lots of containers and use UID separation to keep containers apart. But skyrockets the complexity. Volume mounts become a huge headache. In order to make this work, I would recommend we add -v /SRC/DEST:U, which would chown UID:GID /SRC during the mount to the default UID for the container.
I am not, however, suggesting that these three use cases can be used together. I have seen proposals to the kernel to allow "remapping of UIDs" within a mount point when joining a container, even possibly with bind mounts, but I will leave this to the kernel guys, to see if it is possible, and listen to the security guys about whether or not this is a good idea.
User namespace has been merged into libcontainer at this point, and patches are being prepared to allow it to run in Docker.
One of the problems with all of the container separation modes described here and elsewhere is that they all rely on the kernel for separation. Unlike air gapped computers, or even virtual machines, the processes within the container can talk directly to the host kernel. If the host kernel has a kernel vulnerability that a container can access, they might be able to disable all of the security and break out of the container.
The x86_64 Linux kernel has over 600 system calls, a bug in any one of which could lead to a privilege escalation. Some of the system calls are seldom called, and should be eliminated from access within the container.
Seccomp was developed by Google for removing system calls from a process. Google uses it inside of the Chrome browser for the execution of plugins. Since plugins tend to be untrusted content downloaded from the internet, you really want to control the security of the plugins.
Paul Moore, a coworker of mine, decided to make seccomp a lot easier to use by building a C library to simplify the management of the syscall tree. Libseccomp is now used in tools like qemu, systemd, lxc tools and a few other tools.
We have also written a Go binding for libseccomp that we are working to get into libcontainer to drop system calls from containers.
We are proposing the following list of syscalls be dropped from containers: kexec_load, open_by_handle_at, init_module, finit_module, delete_module, iopl, ioperm, swapon, swapoff, sysfs, sysctl, adjtimex, clock_adjtime, lookup_dcookie, perf_event_open, fanotify_init, and kcmp.
We would like to get other suggestions of syscalls to remove. We are also looking to drop all old networks allowed in Linux, including: Amateur Radio X.25 (3), IPX (4), Appletalk (5), Netrom (6), Bridge (7), ATM VPC (8), X.25 (9), Amateur Radio X.25 PLP (11), DECNet (12), NetBEUI (13), Security (14), PF_KEY key management API (15), and all socket calls greater than AF_NETLINK (16).
Another effect of putting in a system call filter is that it drops all syscalls for other architectures by default. For example, by default you will not be allowed to call i386 syscalls with a seccomp enabled container. We would like to make this the default once it gets merged.
Between eliminating the syscalls mentioned above and the other architectures' system calls, we can shrink the attack surface on the kernel by over half.
Similar to capabilities and SELinux labels, we are also building into Docker the ability to eliminate additional system calls at the command line.
docker run -d --security-opt seccomp:allow:clock_adjtime ntpd
This would allow the syscall back into the container.
docker run -d --security-opt seccomp:deny:getcwd /bin/sh
Similarly, this would eliminate the ability for the container to look at its current working directory. Matt Heon of Red Hat has a short video showing seccomp in action. You can also download the video file here.
We have started with a blacklist of syscalls to be blocked, but for the really adventurous out there, you could start by turning off all system calls and adding some back.
docker run -d --security-opt seccomp:deny:all --security-opt seccomp:allow:getcwd /bin/sh
In reality, you would need a lot more system calls to make this work. Denials of system calls will show up in /var/log/audit/audit.log, just like SELinux errors, or in /var/log/messages, if audit is not running.
Docker in the future
We will continue to look into other security features we can add. If new security features show up in the Linux kernel or are improved, we want to be able to take advantage of these in containers.
One other area we have started looking at is the administration of containers. Currently, if you can talk to Docker's socket or port on the network, you can do anything you want. Sadly you can easily subvert the security of the system, which is why we have turned off access to the /run/docker.socket from non-root users. We are beginning to look at adding authorization, so an admin can prove that he is a particular user. We are also looking in to adding proper logging, so that we can record which admin ran a container with privileges into the syslog/journalctl. Finally, we want to add Role Based Access Control (RBAC), so an admin could control what other admins can do. For example:
- Admin 1 is only allowed to start/stop the following containers.
- Admin 2 is allowed to create a non privileged container on image foobar.
- Admin 3 is allowed to run super privileged containers.
When these security features are fully implemented, Docker containers will be even further immune from security risks on the host system. The goal is to always improve the ability for containers to contain.