Do you know how your system will respond to an arbitrary failure? Will your application fail? Will anything survive after a loss? If you're not sure, it's time to see if your system passes the Litmus test, a detailed way to cause chaos at random with many experiments.
In the first article in this series, I explained what chaos engineering is, and in the second article, I demonstrated how to get your system's steady state so that you can compare it against a chaos state. This third article will show you how to install and use Litmus to test arbitrary failures and experiments in your Kubernetes cluster. In this walkthrough, I'll use Pop!_OS 20.04, Helm 3, Minikube 1.14.2, and Kubernetes 1.19.
Configure Minikube
If you haven't already, install Minikube in whatever way makes sense for your environment. If you have enough resources, I recommend giving your virtual machine a bit more than the default memory and CPU power:
$ minikube config set memory 8192
❗ These changes will take effect upon a minikube delete and then a minikube start
$ minikube config set cpus 6
❗ These changes will take effect upon a minikube delete and then a minikube start
Then start and check your system's status:
$ minikube start
? minikube v1.14.2 on Debian bullseye/sid
? minikube 1.19.0 is available! Download it: https://github.com/kubernetes/minikube/releases/tag/v1.19.0
? To disable this notice, run: 'minikube config set WantUpdateNotification false'
✨ Using the docker driver based on user configuration
? Starting control plane node minikube in cluster minikube
? Creating docker container (CPUs=6, Memory=8192MB) ...
? Preparing Kubernetes v1.19.0 on Docker 19.03.8 ...
? Verifying Kubernetes components...
? Enabled addons: storage-provisioner, default-storageclass
? Done! kubectl is now configured to use "minikube" by default
jess@Athena:~$ minikube status
minikube
type: Control Plane
host: Running
kubelet: Running
apiserver: Running
kubeconfig: Configured
Install Litmus
As outlined on Litmus' homepage, the steps to install Litmus are: add your repo to Helm, create your Litmus namespace, then install your chart:
$ helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
"litmuschaos" has been added to your repositories
$ kubectl create ns litmus
namespace/litmus created
$ helm install chaos litmuschaos/litmus --namespace=litmus
NAME: chaos
LAST DEPLOYED: Sun May 9 17:05:36 2021
NAMESPACE: litmus
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Verify the installation
You can run the following commands if you want to verify all the desired components are installed correctly.
Check if api-resources for chaos are available:
root@demo:~# kubectl api-resources | grep litmus
chaosengines litmuschaos.io true ChaosEngine
chaosexperiments litmuschaos.io true ChaosExperiment
chaosresults litmuschaos.io true ChaosResult
Check if the Litmus chaos operator deployment is running successfully:
root@demo:~# kubectl get pods -n litmus
NAME READY STATUS RESTARTS AGE
litmus-7d998b6568-nnlcd 1/1 Running 0 106s
Start running chaos experiments
With this out of the way, you are good to go! Refer to Litmus' chaos experiment documentation to start executing your first experiment.
To confirm your installation is working, check that the pod is up and running correctly:
jess@Athena:~$ kubectl get pods -n litmus
NAME READY STATUS RESTARTS AGE
litmus-7d6f994d88-2g7wn 1/1 Running 0 115s
Confirm the Custom Resource Definitions (CRDs) are also installed correctly:
jess@Athena:~$ kubectl get crds | grep chaos
chaosengines.litmuschaos.io 2021-05-09T21:05:33Z
chaosexperiments.litmuschaos.io 2021-05-09T21:05:33Z
chaosresults.litmuschaos.io 2021-05-09T21:05:33Z
Finally, confirm your API resources are also installed:
jess@Athena:~$ kubectl api-resources | grep chaos
chaosengines litmuschaos.io true ChaosEngine
chaosexperiments litmuschaos.io true ChaosExperiment
chaosresults litmuschaos.io true ChaosResult
That's what I call easy installation and confirmation. The next step is setting up deployments for chaos.
Prep for destruction
To test for chaos, you need something to test against. Add a new namespace:
$ kubectl create namespace more-apps
namespace/more-apps created
Then add a deployment to the new namespace:
$ kubectl create deployment ghost --namespace more-apps --image=ghost:3.11.0-alpine
deployment.apps/ghost created
Finally, scale your deployment up so that you have more than one pod in your deployment to test against:
$ kubectl scale deployment/ghost --namespace more-apps --replicas=4
deployment.apps/ghost scaled
For Litmus to cause chaos, you need to add an annotation to your deployment to mark it ready for chaos. Currently, annotations are available for deployments, StatefulSets, and DaemonSets. Add the annotation chaos=true
to your deployment:
$ kubectl annotate deploy/ghost litmuschaos.io/chaos="true" -n more-apps
deployment.apps/ghost annotated
Make sure the experiments you will install have the correct permissions to work in the "more-apps" namespace.
Make a new rbac.yaml file for the prepper bindings and permissions:
$ touch rbac.yaml
Then add permissions for the generic testing by copying and pasting the code below into your rbac.yaml file. These are just basic, minimal permissions to kill pods in your namespace and give Litmus permissions to delete a pod for a namespace you provide:
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: pod-delete-sa
namespace: more-apps
labels:
name: pod-delete-sa
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: pod-delete-sa
namespace: more-apps
labels:
name: pod-delete-sa
rules:
- apiGroups: [""]
resources: ["pods","events"]
verbs: ["create","list","get","patch","update","delete","deletecollection"]
- apiGroups: [""]
resources: ["pods/exec","pods/log","replicationcontrollers"]
verbs: ["create","list","get"]
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["create","list","get","delete","deletecollection"]
- apiGroups: ["apps"]
resources: ["deployments","statefulsets","daemonsets","replicasets"]
verbs: ["list","get"]
- apiGroups: ["apps.openshift.io"]
resources: ["deploymentconfigs"]
verbs: ["list","get"]
- apiGroups: ["argoproj.io"]
resources: ["rollouts"]
verbs: ["list","get"]
- apiGroups: ["litmuschaos.io"]
resources: ["chaosengines","chaosexperiments","chaosresults"]
verbs: ["create","list","get","patch","update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: pod-delete-sa
namespace: more-apps
labels:
name: pod-delete-sa
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: pod-delete-sa
subjects:
- kind: ServiceAccount
name: pod-delete-sa
namespace: more-apps
Apply the rbac.yaml file:
$ kubectl apply -f rbac.yaml
serviceaccount/pod-delete-sa created
role.rbac.authorization.k8s.io/pod-delete-sa created
rolebinding.rbac.authorization.k8s.io/pod-delete-sa created
The next step is to prepare your chaos engine to delete pods. The chaos engine will connect the experiment you need to your application instance by creating a chaosengine.yaml file and copying the information below into the .yaml file. This will connect your experiment to your namespace and the service account with the role bindings you created above.
This chaos engine file only specifies the pod to delete during chaos testing:
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: moreapps-chaos
namespace: more-apps
spec:
appinfo:
appns: 'more-apps'
applabel: 'app=ghost'
appkind: 'deployment'
# It can be true/false
annotationCheck: 'true'
# It can be active/stop
engineState: 'active'
#ex. values: ns1:name=percona,ns2:run=more-apps
auxiliaryAppInfo: ''
chaosServiceAccount: pod-delete-sa
# It can be delete/retain
jobCleanUpPolicy: 'delete'
experiments:
- name: pod-delete
spec:
components:
env:
# set chaos duration (in sec) as desired
- name: TOTAL_CHAOS_DURATION
value: '30'
# set chaos interval (in sec) as desired
- name: CHAOS_INTERVAL
value: '10'
# pod failures without '--force' & default terminationGracePeriodSeconds
- name: FORCE
value: 'false'
Don't apply this file until you install the experiments in the next section.
Add new experiments for causing chaos
Now that you have an entirely new environment with deployments, roles, and the chaos engine to test against, you need some experiments to run. Since Litmus has a large community, you can find some great experiments in the Chaos Hub.
In this walkthrough, I'll use the generic experiment of killing a pod.
Run a kubectl command to install the generic experiments into your cluster. Install this in your more-apps
namespace; you will see the tests created when you run it:
$ kubectl apply -f https://hub.litmuschaos.io/api/chaos/1.13.3?file=charts/generic/experiments.yaml -n more-apps
chaosexperiment.litmuschaos.io/pod-network-duplication created
chaosexperiment.litmuschaos.io/node-cpu-hog created
chaosexperiment.litmuschaos.io/node-drain created
chaosexperiment.litmuschaos.io/docker-service-kill created
chaosexperiment.litmuschaos.io/node-taint created
chaosexperiment.litmuschaos.io/pod-autoscaler created
chaosexperiment.litmuschaos.io/pod-network-loss created
chaosexperiment.litmuschaos.io/node-memory-hog created
chaosexperiment.litmuschaos.io/disk-loss created
chaosexperiment.litmuschaos.io/pod-io-stress created
chaosexperiment.litmuschaos.io/pod-network-corruption created
chaosexperiment.litmuschaos.io/container-kill created
chaosexperiment.litmuschaos.io/node-restart created
chaosexperiment.litmuschaos.io/node-io-stress created
chaosexperiment.litmuschaos.io/disk-fill created
chaosexperiment.litmuschaos.io/pod-cpu-hog created
chaosexperiment.litmuschaos.io/pod-network-latency created
chaosexperiment.litmuschaos.io/kubelet-service-kill created
chaosexperiment.litmuschaos.io/k8-pod-delete created
chaosexperiment.litmuschaos.io/pod-delete created
chaosexperiment.litmuschaos.io/node-poweroff created
chaosexperiment.litmuschaos.io/k8-service-kill created
chaosexperiment.litmuschaos.io/pod-memory-hog created
Verify the experiments installed correctly:
$ kubectl get chaosexperiments -n more-apps
NAME AGE
container-kill 72s
disk-fill 72s
disk-loss 72s
docker-service-kill 72s
k8-pod-delete 72s
k8-service-kill 72s
kubelet-service-kill 72s
node-cpu-hog 72s
node-drain 72s
node-io-stress 72s
node-memory-hog 72s
node-poweroff 72s
node-restart 72s
node-taint 72s
pod-autoscaler 72s
pod-cpu-hog 72s
pod-delete 72s
pod-io-stress 72s
pod-memory-hog 72s
pod-network-corruption 72s
pod-network-duplication 72s
pod-network-latency 72s
pod-network-loss 72s
Run the experiments
Now that everything is installed and configured, use your chaosengine.yaml file to run the pod-deletion experiment you defined. Apply your chaos engine file:
$ kubectl apply -f chaosengine.yaml
chaosengine.litmuschaos.io/more-apps-chaos created
Confirm the engine started by getting all the pods in your namespace; you should see pod-delete
being created:
$ kubectl get pods -n more-apps
NAME READY STATUS RESTARTS AGE
ghost-5bdd4cdcc4-blmtl 1/1 Running 0 53m
ghost-5bdd4cdcc4-z2lnt 1/1 Running 0 53m
ghost-5bdd4cdcc4-zlcc9 1/1 Running 0 53m
ghost-5bdd4cdcc4-zrs8f 1/1 Running 0 53m
moreapps-chaos-runner 1/1 Running 0 17s
pod-delete-e443qx-lxzfx 0/1 ContainerCreating 0 7s
Next, you need to be able to observe your experiments using Litmus. The following command uses the ChaosResult CRD and provides a large amount of output:
$ kubectl describe chaosresult moreapps-chaos-pod-delete -n more-apps
Name: moreapps-chaos-pod-delete
Namespace: more-apps
Labels: app.kubernetes.io/component=experiment-job
app.kubernetes.io/part-of=litmus
app.kubernetes.io/version=1.13.3
chaosUID=a6c9ab7e-ff07-4703-abe4-43e03b77bd72
controller-uid=601b7330-c6f3-4d9b-90cb-2c761ac0567a
job-name=pod-delete-e443qx
name=moreapps-chaos-pod-delete
Annotations: <none>
API Version: litmuschaos.io/v1alpha1
Kind: ChaosResult
Metadata:
Creation Timestamp: 2021-05-09T22:06:19Z
Generation: 2
Managed Fields:
API Version: litmuschaos.io/v1alpha1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:labels:
.:
f:app.kubernetes.io/component:
f:app.kubernetes.io/part-of:
f:app.kubernetes.io/version:
f:chaosUID:
f:controller-uid:
f:job-name:
f:name:
f:spec:
.:
f:engine:
f:experiment:
f:status:
.:
f:experimentStatus:
f:history:
Manager: experiments
Operation: Update
Time: 2021-05-09T22:06:53Z
Resource Version: 8406
Self Link: /apis/litmuschaos.io/v1alpha1/namespaces/more-apps/chaosresults/moreapps-chaos-pod-delete
UID: 08b7e3da-d603-49c7-bac4-3b54eb30aff8
Spec:
Engine: moreapps-chaos
Experiment: pod-delete
Status:
Experiment Status:
Fail Step: N/A
Phase: Completed
Probe Success Percentage: 100
Verdict: Pass
History:
Failed Runs: 0
Passed Runs: 1
Stopped Runs: 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pass 104s pod-delete-e443qx-lxzfx experiment: pod-delete, Result: Pass
You can see the pass or fail output from your testing as you run the chaos engine definitions.
Congratulations on your first (and hopefully not last) chaos engineering test! Now you have a powerful tool to use and help your environment grow.
Final thoughts
You might be thinking, "I can't run this manually every time I want to run chaos. How far can I take this, and how can I set it up for the long term?"
Litmus' best part (aside from the Chaos Hub) is its scheduler function. You can use it to define times and dates, repetitions or sporadic, to run experiments. This is a great tool for detailed admins who have been working with Kubernetes for a while and are ready to create some chaos. I suggest staying up to date on Litmus and how to use this tool for regular chaos engineering. Happy pod hunting!
Comments are closed.