Stateful containerized applications with Kubernetes

Creating a reproducible build system for Docker images

Image by:

Håkan Dahlström. CC BY-SA 4.0

To date, almost all of the talk about containers and microservices has been about "stateless" applications. This is entirely understandable because stateless applications are simply easier. However, containers and orchestration have matured to the point where we need to take on the interesting workloads: the stateful ones. That's why two of my talks at SCALE 15x are about databases, containers, and Kubernetes, which is an open source system for automating deployment, scaling, and management of containerized applications.

Stateless services are applications like web servers, proxies, and application code, which may handle data, but they don't store it. These are easy to think about in an orchestration context because they are simple to deploy and simple to scale. If traffic goes up, you just add more of them and load-balance. More importantly, they are "immutable"; there is very little difference between the upstream container "image" and the running containers in your infrastructure. This means you can also replace them at any time, with little "switching cost" between one container instance and another.

Stateful services are things like routers, CDNs (content delivery networks), streaming media servers, and authentication servers. From the moment of deployment, those containers start to differ from their upstream images, and the longer they exist the more they differ. That difference is called a "state." In truth, every running application has at least a little state, but for "stateless" applications that state is small and fast to replace. For stateful ones, it is not. And while a state can be synchronized or replicated across stateful nodes, this has to be done outside of the orchestration system itself by some method specific to the application.

Databases and stateful applications

Of course, given my 18-year history of work on PostgreSQL, the stateful applications I really care about are transactional databases. In addition to being essential to most application stacks, databases are also a great test case for stateful support, because they are stateful in all of the ways it is possible to be stateful, including:

Storage
Identity
Sessions
Cluster role

Until recently, implementing these kinds of states on popular container cloud stacks has been challenging.

For example, PostgreSQL needs to store data and transactions in files that are both persistent and exclusive to each PostgreSQL container (Storage). Each container needs to be identifiable as a specific database node, and we need to be able to route traffic to it by name or address (Identity). Database client connections, or sessions, also have a state, and there's a cost to breaking them, so we don't want to move database nodes around casually (Sessions). Finally, each database node has a role in its database cluster, such as "master," "replica," or "shard" (Cluster role). These cluster roles persist until a database-specific event changes them.

Until recently, implementing these kinds of states on popular container cloud stacks has been challenging. Docker and orchestration frameworks treated most types of states as something that happens outside the container stack, forcing the database architect to manage storage, identity, routing, and everything else. You didn't have a way of moving your database to containers that helped you. As a result, while many web applications became containerized, few databases or other stateful applications did. The answer to, "Where do we store the data?" was generally, "Use Amazon RDS (Relational Database Service)."

Databases using Kubernetes StatefulSet

The Kubernetes project has been working on an object and a set of features for the last year called StatefulSet to handle databases and other stateful services. Developers originally released this feature under the working name of "PetSet," but they changed it to the more appropriate "StatefulSet" in the 1.5 release. At this point, StatefulSet implements both the Storage and Identity stateful qualities. The other two can be implemented with minimal glue code using Kubernetes as a resource. In other words, you don't have to keep waiting to deploy orchestrated, containerized databases. It's time.

Now, you can run a database in containers under Kubernetes, but why should you? The answer has nothing to do with containers, but everything to do with the benefits of orchestration. One thing we expect out of modern database platforms is high availability (HA), but this is not something that the database software alone can supply. Bringing up clusters, replacing database nodes on failed machines, rerouting application traffic to migrated nodes, and other HA considerations require a great deal of code and many utilities outside of the database as well as inside of it.

This is pretty difficult code because it requires implementing a distributed system. Kubernetes, like other orchestration systems, takes care of that for you, making a distributed system easily available. This means that databases without built-in HA, like PostgreSQL and MySQL, can easily become so, and databases that already were HA (like Cassandra and RethinkDB) can become fully automated. Speaking from experience, this is vastly easier than doing it yourself from scratch.

I've created some examples to show you how to deploy PostgreSQL using StatefulSet on my atomicdb demonstration repo. These are meant as illustrations to show you how to use the features rather than as complete production implementations. Zalando, the lead on the Patroni cluster management project, have released a Helm Chart for their Kubernetes-based clustered PostgreSQL. And, of course, I'll be talking about this at SCALE 15x, not once, but twice.

Talks at SCALE 15x

On Friday at SCALE 15x, in the PostgreSQL track, I'll be talking about how do deploy clustered PostgreSQL on Kubernetes. This will include single-master replicated PostgreSQL using Patroni, sharded databases using CitusDB, and other replication platforms. I'll show that not only can you deploy these using containers and Kubernetes, but that using these methods is the best way to do so.

On Sunday, I'll be explaining the different types of state in more detail, and how we implement these through Kubernetes StatefulSet or outside of it. Of course, none of this is all that useful if you don't know how to use Kubernetes, so attend my Kubernetes 101 talk at the Container Day on Thursday.