Running Ceph inside Docker

No readers like this yet.
open source button on keyboard

Opensource.com

Ceph is a fully open source distributed object store, network block device, and file system designed for reliability, performance, and scalability from terabytes to exabytes. Ceph utilizes a novel placement algorithm (CRUSH), active storage nodes, and peer-to-peer gossip protocols to avoid the scalability and reliability problems associated with centralized controllers and lookup tables. Ceph is part of a tremendous and growing ecosystem where it is integrated in virtualization platforms (Proxmox), Cloud platforms (OpenStack, CloudStack, OpenNebula), containers (Docker), and big data (Hadoop, as a meted server for HDFS).

Almost two years have passed since my first attempt to run Ceph inside Docker. I hadn't really had the time to resume this work until recently. For the last couple of months, I have been devoting some of my time to contributing on deploying Ceph in Docker.

(Before we start, I would like to highlight that nothing of this work would have been possible without the help of Seán C. McCord. Indeed, the current ceph-docker repository is based on Seán's initial work.)

Now let's dive in and see how you can get this running!

Rationale

Running Ceph inside Docker is a bit controversial, as many people might believe that there is no point to doing this. While it's not really a problem for monitors, the metadata server, and RADOS gateway to be containerized, things get tricky when it comes to the OSDs (object storage daemons). The Ceph OSD is optimized to the machine it runs on, and has a strong relationship with the hardware. The OSD cannot work if the disk that it relies on dies, and this is a bit of an issue in this container world.

To be honest, at one point I found myself thinking:

I don't know why I'm doing this. I just know that people out there want it (and yes, they probably don't know why either). I think it's important to try anyway, so let's do it.

This does not sound really optimistic, I know, but it's the truth. My view has changed slightly though, so for what it's worth, let me explain why. Maybe it will change your mind as well. (And yes, my explanation will be more than: Docker is fancy, so let's Dockerize everything.)

People have started investing a lot of engineering effort into running containerized softwares on their platforms. Thus, they have been using various tools to build and orchestrate their environment. I wouldn't be surprised to see Kubernetes as the orchestration tool for this matter. Some people also love to run bleeding-edge technologies on production, as they might find other things boring. So with the containerize everything approach, they will be happy that something is happening on their favorite open source storage solution.

Unlike with yum or apt-get, where it is not always easy to roll back, containers are a little different. Upgrades and rollback are made easier, as you can easily use docker stop and docker run to roll out a new version of your daemons. You can also potentially run different clusters on an isolated fashion on the same machine. This is ideal for development.

The project

As mentioned, everything started from the work of Seán C. McCord, and we iterated around his work together. Currently if you use ceph-docker you will be able to run every single Ceph daemon either on Ubuntu or CentOS. We have a lot of images available on the Docker Hub. We use the Ceph namespace, so our images are prefixed as ceph/<daemon>. We use automated builds; as a result, every time we merge a new patch, a new build gets triggered and produces a new version of the container image. As we are currently in a refactoring process, you will see that a lot of images are available. Historically, we had (and we still do until we merge this patch) one image per daemon. So one container image for monitor, osd, mds, and radosgw. This is not really ideal, and in practice, not needed. This is why we worked on a single container image called daemon. This image contains all the Ceph daemons, and you activate the one you want with a parameter while invoking the docker run command. That being said, if you want to start I encourage you to directly use the ceph/daemon image. I'll show an example of how to run it in the next section.

Containerize Ceph

Monitors

Given that monitors can not communicate through a NATed network, we need to use the --net=host to expose Docker's host machine network stack:

$ sudo docker run -d --net=host \
-v /etc/ceph:/etc/ceph \
-v /var/lib/ceph/:/var/lib/ceph \
-e MON_IP=192.168.0.20 \
-e CEPH_PUBLIC_NETWORK=192.168.0.0/24 \
ceph/daemon mon

Here are the options available to you.

  • MON_IP is the IP address of your host running Docker.
  • MON_NAME is the name of your monitor (DEFAULT: $(hostname)).
  • CEPH_PUBLIC_NETWORK is the CIDR of the host running Docker. It should be in the same network as the MON_IP.
  • CEPH_CLUSTER_NETWORK is the CIDR of a secondary interface of the host running Docker. Used for the OSD replication traffic.

Object Storage Daemon

The current implementation allows you to run a single OSD process per container. Following the microservice mindset, we should not run more than one service inside our container. In our case, running multiple OSD processes into a single container breaks this rule and will likely introduce undesirable behaviors. This will also increase the setup and maintenance complexity of the solution.

In this configuration, the usage of --privileged=true is strictly required because we need full access to /dev/ and other kernel functions. However, we support another configuration based on exposing OSD directories where the operators will do the appropriate preparation of the devices. Then he or she will simply expose the OSD directory and populating (ceph-osd mkfs) the OSD will be done by the entry point. The configuration I'm presenting now is easier to start with because you only need to specify a block device and the entry point will do the rest.

Those who do not want to use --privileged=true, can fall back on the second example.

$ sudo docker run -d --net=host \
--privileged=true \
-v /etc/ceph:/etc/ceph \
-v /var/lib/ceph/:/var/lib/ceph \
-v /dev/:/dev/ \
-e OSD_DEVICE=/dev/vdd \
ceph-daemon osd_ceph_disk

If you don't want to use --privileged=true you can always prepare the OSD by yourself with the help of your configuration management of your choice.

Example without a privileged mode, in this example we assume that you partitioned, put a filesystem and mounted the OSD partition. To create your OSDs simply run the following command:

$ sudo docker exec <mon-container-id> ceph osd create.

Then run your container like so:

docker run -v /osds/1:/var/lib/ceph/osd/ceph-1 -v /osds/2:/var/lib/ceph/osd/ceph-2

$ sudo docker run -d --net=host \
-v /etc/ceph:/etc/ceph \
-v /var/lib/ceph/:/var/lib/ceph \
-v /osds/1:/var/lib/ceph/osd/ceph-1 \
ceph-daemon osd_disk_directory

Here are the options available to you.

  • OSD_DEVICE is the OSD device, ie: /dev/sdb
  • OSD_JOURNAL is the device that will be used to store the OSD's journal, ie: /dev/sdz
  • HOSTNAME is the hostname of the hostname of the container where the OSD runs (DEFAULT: $(hostname))
  • OSD_FORCE_ZAP will force zapping the content of the given device (DEFAULT: 0 and 1 to force it)
  • OSD_JOURNAL_SIZE is the size of the OSD journal (DEFAULT: 100)

Metadata Server

This one is pretty straightforward and easy to bootstrap. The only caveat at the moment is that we require the Ceph admin key to be available in the Docker. This key will be used to create the CephFS pools and the filesystem.

If you run an old version of Ceph (prior to 0.87) you don't need this, but you might want to know since it's always best to run the latest version!

$ sudo docker run -d --net=host \
-v /var/lib/ceph/:/var/lib/ceph \
-v /etc/ceph:/etc/ceph \
-e CEPHFS_CREATE=1 \
ceph-daemon mds

Here are the options available to you.

  • MDS_NAME is the name of the Metadata server (DEFAULT: mds-$(hostname)).
  • CEPHFS_CREATE will create a filesystem for your Metadata server (DEFAULT: 0 and 1 to enable it).
  • CEPHFS_NAME is the name of the Metadata filesystem (DEFAULT: cephfs).
  • CEPHFS_DATA_POOL is the name of the data pool for the Metadata Server (DEFAULT: cephfs_data).
  • CEPHFS_DATA_POOL_PG is the number of placement groups for the data pool (DEFAULT: 8).
  • CEPHFS_DATA_POOL is the name of the metadata pool for the Metadata Server (DEFAULT: cephfs_metadata).
  • CEPHFS_METADATA_POOL_PG is the number of placement groups for the metadata pool (DEFAULT: 8).

RADOS gateway

For the RADOS gateway, we deploy it with civetweb enabled by default. However, it is possible to use different CGI frontends by simply giving remote address and port.

$ sudo docker run -d --net=host \
-v /var/lib/ceph/:/var/lib/ceph \
-v /etc/ceph:/etc/ceph \
ceph-daemon rgw

Here are the options available to you.

  • RGW_REMOTE_CGI defines if you use the embedded web server of RADOS gateway or not (DEFAULT: 0 and 1 to disable it).
  • RGW_REMOTE_CGI_HOST is the remote host running a CGI process.
  • RGW_REMOTE_CGI_PORT is the remote port of the host running a CGI process.
  • RGW_CIVETWEB_PORT is the listening port of civetweb (DEFAULT: 80).
  • RGW_NAME is the name of the RADOS gateway instance (DEFAULT: $(hostname)).

Further work

Configuration store backends

By default, the ceph.conf and all the Ceph keys are generated during the initial monitor bootstrap. This process assumes that to extend your cluster to multiple nodes you have to distribute these configurations across all the nodes. This is not really flexible and we want to improve this. One thing that I will propose soon is to use Ansible to generate the configuration/keys and to distribute them on all the machines.

Alternatively, we want to be able to store various configuration files on different backends like etcd and consul.

Orchestrating the deployment

The very first step is to use ceph-ansible, where the logic is already implemented. I still need to push some changes, but most of the work is already present. For Kubernetes, a preview on how to bootstrap monitors is already available.

Extending to Rocket and beyond

There's not much to do here, as you can simply port your Docker images into Rocket and launch them (pun intended).

Want to learn more? A video demonstration of the process is available below.

User profile image.
Sebastien Han currently works as a Senior Cloud Architect for Red Hat. He has been involved since 2011 with OpenStack and Ceph and has built a strong expertise around these two technologies. Curious and passionate, he loves working on bleeding edge technologies and always hope to find a suitable spot to integrate his two favorite technologies.

Comments are closed.

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.