Running Kubernetes clusters in production is a big undertaking with lots of moving parts. Keeping an eye on all those various parts is no easy task. To make the problem worse, Kubernetes is very distributed and oftentimes self-healing. If something is going wrong in your cluster, it may be intermittent enough (or specific enough) that you won't see the breakage for quite a while. During that time, of course, your customers or developers may have a degraded or completely broken experience.
Some examples of sneaky things that may go unnoticed for long periods:
- CNI agent crashes
- Nodes blocked from the API server due to rate limiting caused by a noisy pod hosted on them
- Kubelet crashes intermittently
- Intermittent Kubernetes API connectivity problems
- Single-pod failures of kube-dns or CoreDNS
Traditional metrics and alerts are not enough to identify these types of failure conditions accurately, but these issues can cause pods not to schedule, rolling updates to hang, DNS queries to be answered incorrectly, traffic to be load-balanced improperly, and many, many more bad things. Clearly, there is a need for an additional source of monitoring that looks deep into the functionality of Kubernetes to present a clear picture of the health of a cluster.
Fortunately, setting up synthetic monitoring for Kubernetes has been made easier by Kuberhealthy, an open source project created and used by Comcast. Kuberhealthy, among other things, runs a check every 15 minutes on your cluster to ensure that every node can properly deploy and tear down a pod within an acceptable time. This simple test ensures that the cluster scheduler, Kubernetes API, and CNI provisioning functionality are operational.
The results of these checks can be easily served up and monitored as Prometheus metrics or by scraping a simple JSON status page served by Kuberhealthy. More setup details and check information are available in Kuberhealthy's README.
Due to the self-healing and distributed nature of Kubernetes and its services, it is possible for many issues in production Kubernetes clusters to go unnoticed and unknown for a long time. By enabling some simple synthetic checking, we stand a much better chance of catching these kinds of ephemeral and limited-scope disturbances in our infrastructure before customers or developers notice.
1 Comment