Test Kubernetes cluster failures and experiments in your terminal

Litmus is an effective tool to cause chaos to test how your system will respond to failure.

Do you know how your system will respond to an arbitrary failure? Will your application fail? Will anything survive after a loss? If you're not sure, it's time to see if your system passes the Litmus test, a detailed way to cause chaos at random with many experiments.

In the first article in this series, I explained what chaos engineering is, and in the second article, I demonstrated how to get your system's steady state so that you can compare it against a chaos state. This third article will show you how to install and use Litmus to test arbitrary failures and experiments in your Kubernetes cluster. In this walkthrough, I'll use Pop!_OS 20.04, Helm 3, Minikube 1.14.2, and Kubernetes 1.19.

Configure Minikube

If you haven't already, install Minikube in whatever way makes sense for your environment. If you have enough resources, I recommend giving your virtual machine a bit more than the default memory and CPU power:

$ minikube config set memory 8192
❗  These changes will take effect upon a minikube delete and then a minikube start
$ minikube config set cpus 6
❗  These changes will take effect upon a minikube delete and then a minikube start

Then start and check your system's status:

$ minikube start
?  minikube v1.14.2 on Debian bullseye/sid
?  minikube 1.19.0 is available! Download it: https://github.com/kubernetes/minikube/releases/tag/v1.19.0
?  To disable this notice, run: 'minikube config set WantUpdateNotification false'

✨  Using the docker driver based on user configuration
?  Starting control plane node minikube in cluster minikube
?  Creating docker container (CPUs=6, Memory=8192MB) ...
?  Preparing Kubernetes v1.19.0 on Docker 19.03.8 ...
?  Verifying Kubernetes components...
?  Enabled addons: storage-provisioner, default-storageclass
?  Done! kubectl is now configured to use "minikube" by default
jess@Athena:~$ minikube status
minikube
type: Control Plane
host: Running
kubelet: Running
apiserver: Running
kubeconfig: Configured

Install Litmus

As outlined on Litmus' homepage, the steps to install Litmus are: add your repo to Helm, create your Litmus namespace, then install your chart:

$ helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
"litmuschaos" has been added to your repositories

$ kubectl create ns litmus
namespace/litmus created

$ helm install chaos litmuschaos/litmus --namespace=litmus
NAME: chaos
LAST DEPLOYED: Sun May  9 17:05:36 2021
NAMESPACE: litmus
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:

Verify the installation

You can run the following commands if you want to verify all the desired components are installed correctly.

Check if api-resources for chaos are available:

root@demo:~# kubectl api-resources | grep litmus 
chaosengines                                   litmuschaos.io                 true         ChaosEngine
chaosexperiments                               litmuschaos.io                 true         ChaosExperiment
chaosresults                                   litmuschaos.io                 true         ChaosResult

Check if the Litmus chaos operator deployment is running successfully:

root@demo:~# kubectl get pods -n litmus 
NAME                      READY   STATUS    RESTARTS   AGE
litmus-7d998b6568-nnlcd   1/1     Running   0          106s

Start running chaos experiments

With this out of the way, you are good to go! Refer to Litmus' chaos experiment documentation to start executing your first experiment.

To confirm your installation is working, check that the pod is up and running correctly:

jess@Athena:~$ kubectl get pods -n litmus
NAME                      READY   STATUS    RESTARTS   AGE
litmus-7d6f994d88-2g7wn   1/1     Running   0          115s

Confirm the Custom Resource Definitions (CRDs) are also installed correctly:

jess@Athena:~$ kubectl get crds | grep chaos
chaosengines.litmuschaos.io       2021-05-09T21:05:33Z
chaosexperiments.litmuschaos.io   2021-05-09T21:05:33Z
chaosresults.litmuschaos.io       2021-05-09T21:05:33Z

Finally, confirm your API resources are also installed:

jess@Athena:~$ kubectl api-resources | grep chaos
chaosengines                                   litmuschaos.io                 true         ChaosEngine
chaosexperiments                               litmuschaos.io                 true         ChaosExperiment
chaosresults                                   litmuschaos.io                 true         ChaosResult

That's what I call easy installation and confirmation. The next step is setting up deployments for chaos.

Prep for destruction

To test for chaos, you need something to test against. Add a new namespace:

$ kubectl create namespace more-apps
namespace/more-apps created

Then add a deployment to the new namespace:

$ kubectl create deployment ghost --namespace more-apps --image=ghost:3.11.0-alpine
deployment.apps/ghost created

Finally, scale your deployment up so that you have more than one pod in your deployment to test against:

$ kubectl scale deployment/ghost --namespace more-apps --replicas=4
deployment.apps/ghost scaled

For Litmus to cause chaos, you need to add an annotation to your deployment to mark it ready for chaos. Currently, annotations are available for deployments, StatefulSets, and DaemonSets. Add the annotation chaos=true to your deployment:

$ kubectl annotate deploy/ghost litmuschaos.io/chaos="true" -n more-apps
deployment.apps/ghost annotated

Make sure the experiments you will install have the correct permissions to work in the "more-apps" namespace.

Make a new rbac.yaml file for the prepper bindings and permissions:

$ touch rbac.yaml

Then add permissions for the generic testing by copying and pasting the code below into your rbac.yaml file. These are just basic, minimal permissions to kill pods in your namespace and give Litmus permissions to delete a pod for a namespace you provide:

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: pod-delete-sa
  namespace: more-apps
  labels:
    name: pod-delete-sa
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-delete-sa
  namespace: more-apps
  labels:
    name: pod-delete-sa
rules:
- apiGroups: [""]
  resources: ["pods","events"]
  verbs: ["create","list","get","patch","update","delete","deletecollection"]
- apiGroups: [""]
  resources: ["pods/exec","pods/log","replicationcontrollers"]
  verbs: ["create","list","get"]
- apiGroups: ["batch"]
  resources: ["jobs"]
  verbs: ["create","list","get","delete","deletecollection"]
- apiGroups: ["apps"]
  resources: ["deployments","statefulsets","daemonsets","replicasets"]
  verbs: ["list","get"]
- apiGroups: ["apps.openshift.io"]
  resources: ["deploymentconfigs"]
  verbs: ["list","get"]
- apiGroups: ["argoproj.io"]
  resources: ["rollouts"]
  verbs: ["list","get"]
- apiGroups: ["litmuschaos.io"]
  resources: ["chaosengines","chaosexperiments","chaosresults"]
  verbs: ["create","list","get","patch","update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: pod-delete-sa
  namespace: more-apps
  labels:
    name: pod-delete-sa
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: pod-delete-sa
subjects:
- kind: ServiceAccount
  name: pod-delete-sa
  namespace: more-apps

Apply the rbac.yaml file:

$ kubectl apply -f rbac.yaml
serviceaccount/pod-delete-sa created
role.rbac.authorization.k8s.io/pod-delete-sa created
rolebinding.rbac.authorization.k8s.io/pod-delete-sa created

The next step is to prepare your chaos engine to delete pods. The chaos engine will connect the experiment you need to your application instance by creating a chaosengine.yaml file and copying the information below into the .yaml file. This will connect your experiment to your namespace and the service account with the role bindings you created above.

This chaos engine file only specifies the pod to delete during chaos testing:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: moreapps-chaos
  namespace: more-apps
spec:
  appinfo:
    appns: 'more-apps'
    applabel: 'app=ghost'
    appkind: 'deployment'
  # It can be true/false
  annotationCheck: 'true'
  # It can be active/stop
  engineState: 'active'
  #ex. values: ns1:name=percona,ns2:run=more-apps
  auxiliaryAppInfo: ''
  chaosServiceAccount: pod-delete-sa
  # It can be delete/retain
  jobCleanUpPolicy: 'delete'
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            # set chaos duration (in sec) as desired
            - name: TOTAL_CHAOS_DURATION
              value: '30'

            # set chaos interval (in sec) as desired
            - name: CHAOS_INTERVAL
              value: '10'

            # pod failures without '--force' & default terminationGracePeriodSeconds
            - name: FORCE
              value: 'false'

Don't apply this file until you install the experiments in the next section.

Add new experiments for causing chaos

Now that you have an entirely new environment with deployments, roles, and the chaos engine to test against, you need some experiments to run. Since Litmus has a large community, you can find some great experiments in the Chaos Hub.

In this walkthrough, I'll use the generic experiment of killing a pod.

Run a kubectl command to install the generic experiments into your cluster. Install this in your more-apps namespace; you will see the tests created when you run it:

$ kubectl apply -f https://hub.litmuschaos.io/api/chaos/1.13.3?file=charts/generic/experiments.yaml -n more-apps
chaosexperiment.litmuschaos.io/pod-network-duplication created
chaosexperiment.litmuschaos.io/node-cpu-hog created
chaosexperiment.litmuschaos.io/node-drain created
chaosexperiment.litmuschaos.io/docker-service-kill created
chaosexperiment.litmuschaos.io/node-taint created
chaosexperiment.litmuschaos.io/pod-autoscaler created
chaosexperiment.litmuschaos.io/pod-network-loss created
chaosexperiment.litmuschaos.io/node-memory-hog created
chaosexperiment.litmuschaos.io/disk-loss created
chaosexperiment.litmuschaos.io/pod-io-stress created
chaosexperiment.litmuschaos.io/pod-network-corruption created
chaosexperiment.litmuschaos.io/container-kill created
chaosexperiment.litmuschaos.io/node-restart created
chaosexperiment.litmuschaos.io/node-io-stress created
chaosexperiment.litmuschaos.io/disk-fill created
chaosexperiment.litmuschaos.io/pod-cpu-hog created
chaosexperiment.litmuschaos.io/pod-network-latency created
chaosexperiment.litmuschaos.io/kubelet-service-kill created
chaosexperiment.litmuschaos.io/k8-pod-delete created
chaosexperiment.litmuschaos.io/pod-delete created
chaosexperiment.litmuschaos.io/node-poweroff created
chaosexperiment.litmuschaos.io/k8-service-kill created
chaosexperiment.litmuschaos.io/pod-memory-hog created

Verify the experiments installed correctly:

$ kubectl get chaosexperiments -n more-apps
NAME                      AGE
container-kill            72s
disk-fill                 72s
disk-loss                 72s
docker-service-kill       72s
k8-pod-delete             72s
k8-service-kill           72s
kubelet-service-kill      72s
node-cpu-hog              72s
node-drain                72s
node-io-stress            72s
node-memory-hog           72s
node-poweroff             72s
node-restart              72s
node-taint                72s
pod-autoscaler            72s
pod-cpu-hog               72s
pod-delete                72s
pod-io-stress             72s
pod-memory-hog            72s
pod-network-corruption    72s
pod-network-duplication   72s
pod-network-latency       72s
pod-network-loss          72s

Run the experiments

Now that everything is installed and configured, use your chaosengine.yaml file to run the pod-deletion experiment you defined. Apply your chaos engine file:

$ kubectl apply -f chaosengine.yaml
chaosengine.litmuschaos.io/more-apps-chaos created

Confirm the engine started by getting all the pods in your namespace; you should see pod-delete being created:

$ kubectl get pods -n more-apps
NAME                      READY   STATUS              RESTARTS   AGE
ghost-5bdd4cdcc4-blmtl    1/1     Running             0          53m
ghost-5bdd4cdcc4-z2lnt    1/1     Running             0          53m
ghost-5bdd4cdcc4-zlcc9    1/1     Running             0          53m
ghost-5bdd4cdcc4-zrs8f    1/1     Running             0          53m
moreapps-chaos-runner     1/1     Running             0          17s
pod-delete-e443qx-lxzfx   0/1     ContainerCreating   0          7s

Next, you need to be able to observe your experiments using Litmus. The following command uses the ChaosResult CRD and provides a large amount of output:

$ kubectl describe chaosresult moreapps-chaos-pod-delete -n more-apps
Name:         moreapps-chaos-pod-delete
Namespace:    more-apps
Labels:       app.kubernetes.io/component=experiment-job
              app.kubernetes.io/part-of=litmus
              app.kubernetes.io/version=1.13.3
              chaosUID=a6c9ab7e-ff07-4703-abe4-43e03b77bd72
              controller-uid=601b7330-c6f3-4d9b-90cb-2c761ac0567a
              job-name=pod-delete-e443qx
              name=moreapps-chaos-pod-delete
Annotations:  <none>
API Version:  litmuschaos.io/v1alpha1
Kind:         ChaosResult
Metadata:
  Creation Timestamp:  2021-05-09T22:06:19Z
  Generation:          2
  Managed Fields:
    API Version:  litmuschaos.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .:
          f:app.kubernetes.io/component:
          f:app.kubernetes.io/part-of:
          f:app.kubernetes.io/version:
          f:chaosUID:
          f:controller-uid:
          f:job-name:
          f:name:
      f:spec:
        .:
        f:engine:
        f:experiment:
      f:status:
        .:
        f:experimentStatus:
        f:history:
    Manager:         experiments
    Operation:       Update
    Time:            2021-05-09T22:06:53Z
  Resource Version:  8406
  Self Link:         /apis/litmuschaos.io/v1alpha1/namespaces/more-apps/chaosresults/moreapps-chaos-pod-delete
  UID:               08b7e3da-d603-49c7-bac4-3b54eb30aff8
Spec:
  Engine:      moreapps-chaos
  Experiment:  pod-delete
Status:
  Experiment Status:
    Fail Step:                 N/A
    Phase:                     Completed
    Probe Success Percentage:  100
    Verdict:                   Pass
  History:
    Failed Runs:   0
    Passed Runs:   1
    Stopped Runs:  0
Events:
  Type    Reason   Age    From                     Message
  ----    ------   ----   ----                     -------
  Normal  Pass     104s   pod-delete-e443qx-lxzfx  experiment: pod-delete, Result: Pass

You can see the pass or fail output from your testing as you run the chaos engine definitions.

Congratulations on your first (and hopefully not last) chaos engineering test! Now you have a powerful tool to use and help your environment grow.

Final thoughts

You might be thinking, "I can't run this manually every time I want to run chaos. How far can I take this, and how can I set it up for the long term?"

Litmus' best part (aside from the Chaos Hub) is its scheduler function. You can use it to define times and dates, repetitions or sporadic, to run experiments. This is a great tool for detailed admins who have been working with Kubernetes for a while and are ready to create some chaos. I suggest staying up to date on Litmus and how to use this tool for regular chaos engineering. Happy pod hunting!

Get started with Kubernetes using chaos engineering

Learn the basics of chaos engineering in this first article in a series celebrating Kubernetes' 11th birthday.

How Kubernetes became the solution for migrating legacy applications

Start monitoring your Kubernetes cluster with Prometheus and Grafana

Before you can measure chaos, you need to know what your system's steady state looks like. Learn how in the second article in this series about chaos engineering.