Test arbitrary pod failures on Kubernetes with kube-monkey

Image by:

Opensource.com

I have covered multiple chaos engineering tools in this series. The first article in this series explained what chaos engineering is; the second demonstrated how to get your system's steady state so that you can compare it against a chaos state; the third showed how to use Litmus to test arbitrary failures and experiments in your Kubernetes cluster; and the fourth article got into Chaos Mesh, an open source chaos orchestrator with a web user interface.

In this fifth article, I want to talk about arbitrary pod failure. Kube-monkey offers an easy way to stress-test your systems by scheduling random termination pods in your cluster. This aims to encourage and validate the development of failure-resilient services. As in the previous walkthroughs, I'll use Pop!_OS 20.04, Helm 3, Minikube 1.14.2, and Kubernetes 1.19.

Configure Minikube

If you haven't already, install Minikube in whatever way makes sense for your environment. If you have enough resources, I recommend giving your virtual machine a bit more than the default memory and CPU power:

$ minikube config set memory 8192
❗  These changes will take effect upon a minikube delete and then a minikube start
$ minikube config set cpus 6
❗  These changes will take effect upon a minikube delete and then a minikube start

Then start and check the status of your system:

$ minikube start
?  minikube v1.14.2 on Debian bullseye/sid
?  minikube 1.19.0 is available! Download it: https://github.com/kubernetes/minikube/releases/tag/v1.19.0
?  To disable this notice, run: 'minikube config set WantUpdateNotification false'

✨  Using the docker driver based on user configuration
?  Starting control plane node minikube in cluster minikube
?  Creating docker container (CPUs=6, Memory=8192MB) ...
?  Preparing Kubernetes v1.19.0 on Docker 19.03.8 ...
?  Verifying Kubernetes components...
?  Enabled addons: storage-provisioner, default-storageclass
?  Done! kubectl is now configured to use "minikube" by default
$ minikube status
minikube
type: Control Plane
host: Running
kubelet: Running
apiserver: Running
kubeconfig: Configured

Preconfiguring with deployments

Start by adding some small deployments to run chaos against. These deployments will need some special labels, so you need to create a new Helm chart. The following labels will help kube-monkey determine what to kill if the app is opted-in to doing chaos and understand what details are behind the chaos:

kube-monkey/enabled: This setting opts you in to starting the chaos.
kube-monkey/mtbf: This stands for mean time between failure (in days). For example, if it's set to 3, the Kubernetes (K8s) app expects to have a pod killed approximately every third weekday.
kube-monkey/identifier: This is a unique identifier for the K8s apps; in this example, it will be "nginx."
kube-monkey/kill-mode: The kube-monkey's default behavior is to kill only one pod in the cluster, but you can change it to add more:
- kill-all: Kill every pod, no matter what is happening with a pod
- fixed: Pick a number of pods you want to kill
- fixed-percent: Kill a fixed percent of pods (e.g., 50%)
kube-monkey/kill-value: This is where you can specify a value for kill-mode
- fixed: The number of pods to kill
- random-max-percent: The maximum number from 0–100 that kube-monkey can kill
- fixed-percent: The percentage, from 0–100 percent, of pods to kill

Now that you have this background info, you can start creating a basic Helm chart.

I named this Helm chart nginx. I'll show only the changes to the Helm chart deployment labels below. You need to change the deployment YAML file, which is nginx/templates in this example:

$ /chaos/kube-monkey/helm/nginx/templates$ ls -la
total 40
drwxr-xr-x 3 jess jess 4096 May 15 14:46 .
drwxr-xr-x 4 jess jess 4096 May 15 14:46 ..
-rw-r--r-- 1 jess jess 1826 May 15 14:46 deployment.yaml
-rw-r--r-- 1 jess jess 1762 May 15 14:46 _helpers.tpl
-rw-r--r-- 1 jess jess  910 May 15 14:46 hpa.yaml
-rw-r--r-- 1 jess jess 1048 May 15 14:46 ingress.yaml
-rw-r--r-- 1 jess jess 1735 May 15 14:46 NOTES.txt
-rw-r--r-- 1 jess jess  316 May 15 14:46 serviceaccount.yaml
-rw-r--r-- 1 jess jess  355 May 15 14:46 service.yaml
drwxr-xr-x 2 jess jess 4096 May 15 14:46 tests

In your deployment.yaml file, find this section:

  template:
    metadata:
      {{- with .Values.podAnnotations }}
      annotations:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      labels:
        {{- include "nginx.selectorLabels" . | nindent 8 }}

And make these changes:

  template:
    metadata:
      {{- with .Values.podAnnotations }}
      annotations:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      labels:
        {{- include "nginx.selectorLabels" . | nindent 8 }}
        kube-monkey/enabled: enabled
        kube-monkey/identifier: monkey-victim
        kube-monkey/mtbf: '2'
        kube-monkey/kill-mode: "fixed"
        kube-monkey/kill-value: '1'

Move back one directory and find the values file:

$ /chaos/kube-monkey/helm/nginx/templates$ cd ../
$ /chaos/kube-monkey/helm/nginx$ ls
charts  Chart.yaml  templates  values.yaml

You need to change one line in the values file, from:

replicaCount: 1

to:

replicaCount: 8

This will give you eight different pods to test chaos against.

Move back one more directory and install the new Helm chart:

$ /chaos/kube-monkey/helm/nginx$ cd ../
$ /chaos/kube-monkey/helm$ helm install nginxtest nginx
NAME: nginxtest
LAST DEPLOYED: Sat May 15 14:53:47 2021
NAMESPACE: default
STATUS: deployed
REVISION: 1
NOTES:
1. Get the application URL by running these commands:
  export POD_NAME=$(kubectl get pods --namespace default -l "app.kubernetes.io/name=nginx,app.kubernetes.io/instance=nginxtest" -o jsonpath="{.items[0].metadata.name}")
  export CONTAINER_PORT=$(kubectl get pod --namespace default $POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
  echo "Visit https://127.0.0.1:8080 to use your application"
  kubectl --namespace default port-forward $POD_NAME 8080:$CONTAINER_PORT

Then check the labels in your Nginx pods:

$ /chaos/kube-monkey/helm$ kubectl get pods -n default
NAME                                 READY   STATUS    RESTARTS   AGE
nginxtest-8f967857-88zv7             1/1     Running   0          80s
nginxtest-8f967857-8qb95             1/1     Running   0          80s
nginxtest-8f967857-dlng7             1/1     Running   0          80s
nginxtest-8f967857-h7mmc             1/1     Running   0          80s
nginxtest-8f967857-pdzpq             1/1     Running   0          80s
nginxtest-8f967857-rdpnb             1/1     Running   0          80s
nginxtest-8f967857-rqv2w             1/1     Running   0          80s
nginxtest-8f967857-tr2cn             1/1     Running   0          80s

Chose the first pod to describe and confirm the labels are in place:

$ /chaos/kube-monkey/helm$ kubectl describe pod nginxtest-8f967857-88zv7 -n default
Name:         nginxtest-8f967857-88zv7
Namespace:    default
Priority:     0
Node:         minikube/192.168.49.2
Start Time:   Sat, 15 May 2021 15:11:37 -0400
Labels:       app.kubernetes.io/instance=nginxtest
              app.kubernetes.io/name=nginx
              kube-monkey/enabled=enabled
              kube-monkey/identifier=monkey-victim
              kube-monkey/kill-mode=fixed
              kube-monkey/kill-value=1
              kube-monkey/mtbf=2
              pod-template-hash=8f967857

Configure and install kube-monkey

To install kube-monkey using Helm, you first need to run git clone on the kube-monkey repository:

$ /chaos$ git clone https://github.com/asobti/kube-monkey
Cloning into 'kube-monkey'...
remote: Enumerating objects: 14641, done.
remote: Counting objects: 100% (47/47), done.
remote: Compressing objects: 100% (36/36), done.
remote: Total 14641 (delta 18), reused 22 (delta 8), pack-reused 14594
Receiving objects: 100% (14641/14641), 30.56 MiB | 39.31 MiB/s, done.
Resolving deltas: 100% (6502/6502), done.

Change to the kube-monkey/helm directory:

$ /chaos$ cd kube-monkey/helm/
$ /chaos/kube-monkey/helm$

Then go into the Helm chart and find the values.yaml file:

$ /chaos/kube-monkey/helm$ cd kubemonkey/
$ /chaos/kube-monkey/helm/kubemonkey$ ls
Chart.yaml  README.md  templates  values.yaml

Below, I will show just the sections of the values.yaml file you need to change. They disable dry-run mode by changing it in the config section to false, then add the default namespace to the whitelist so that it can kill the pods you deployed. You must keep the blacklistedNamespaces value or you will cause severe damage to your system.

Change this:

config:
  dryRun: true  
  runHour: 8
  startHour: 10
  endHour: 16
  blacklistedNamespaces:
    - kube-system
  whitelistedNamespaces:  []

To this:

config:
  dryRun: false  
  runHour: 8
  startHour: 10
  endHour: 16
  blacklistedNamespaces:
    - kube-system
  whitelistedNamespaces:  ["default"]

In the debug section, set enabled and schedule_immediate_kill to true. This will show the pods being killed.

Change this:

  debug:
   enabled: false 
   schedule_immediate_kill: false

To this:

  debug:
   enabled: true 
   schedule_immediate_kill: true

Run a helm install:

$ /chaos/kube-monkey/helm$ helm install chaos kubemonkey
NAME: chaos
LAST DEPLOYED: Sat May 15 13:51:59 2021
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Wait until the application is rolled out:
  kubectl -n default rollout status deployment chaos-kube-monkey
2. Check the logs:
  kubectl logs -f deployment.apps/chaos-kube-monkey -n default

Check the kube-monkey logs and see that the pods are being terminated:

 $ /chaos/kube-monkey/helm$ kubectl logs -f deployment.apps/chaos-kube-monkey -n default

	********** Today's schedule **********
	k8 Api Kind	Kind Name		Termination Time
	-----------	---------		----------------
	v1.Deployment	nginxtest		05/15/2021 15:15:22 -0400 EDT
	********** End of schedule **********
I0515 19:15:22.343202       1 kubemonkey.go:70] Termination successfully executed for v1.Deployment nginxtest
I0515 19:15:22.343216       1 kubemonkey.go:73] Status Update: 0 scheduled terminations left.
I0515 19:15:22.343220       1 kubemonkey.go:76] Status Update: All terminations done.
I0515 19:15:22.343278       1 kubemonkey.go:19] Debug mode detected!
I0515 19:15:22.343283       1 kubemonkey.go:20] Status Update: Generating next schedule in 30 sec

You can also use K9s and watch the pods die.

Image by:

^{(Jess Cherry, CC BY-SA 4.0)}

Congratulations! You now have a running chaos test with arbitrary failures. Anytime you want, you can change your applications to test at a certain day of the week and time of day.

Final thoughts

While kube-monkey is a great chaos engineering tool, it does require heavy configurations. Therefore, it isn't the best starter chaos engineering tool for someone new to Kubernetes. Another drawback is you have to edit your application's Helm chart for chaos testing to run.

This tool would be best positioned in a staging environment to watch how applications respond to arbitrary failure regularly. This gives you a long-term way to keep track of unsteady states using cluster monitoring tools. It also keeps notes that you can use for recovery of your internal applications in production.