My DevOps journey kicked off when I started developing Datree, an open source command that aims to help DevOps engineers to prevent Kubernetes misconfigurations from reaching production. One year later, seeking best practices and more ways to prevent misconfigurations became my way of life.
This is why when I first learned about Argo CD, the thought of using Argo without knowing its pitfalls and complications simply didn't make sense to me. After all, it's probable that configuring it incorrectly can easily cause the next production outage.
In this article, I'll explore some of the best practices of Argo that I've found, and show you how to validate custom resources against these best practices.
Disallow providing an empty retryStrategy
Project: Argo Workflows
Best practice: A user can specify a retryStrategy
that dictates how errors and failures are retried in a workflow. Providing an empty retryStrategy (retryStrategy: {})
causes a container to retry until completion, and eventually causes out-of-memory (OOM) issues.
Ensure that Workflow pods are not configured to use the default service account
Project: Argo Workflows
Best practice: All pods in a workflow run with a service account, which can be specified in the workflow.spec.serviceAccountName
. If omitted, Argo uses the default
service account of the workflow's namespace. This provides the workflow (the pod) the ability to interact with the Kubernetes API server. This allows attackers with access to a single container to abuse Kubernetes by using the AutomountServiceAccountToken
. If by any chance, the option for AutomountServiceAccountToken
was disabled, then the default service account that Argo uses won't have any permissions, and the workflow fails.
It's recommended to create dedicated user-managed service accounts with the appropriate roles.
Set the label 'part-of: argocd' in ConfigMaps
Project: Argo CD
When installing Argo CD, its atomic configuration contains a few services and configMaps
. For each specific kind of ConfigMap and Secret resource, there is only a single supported resource name (as listed in the above table). If you need to merge things, do it before creating them. It's important to annotate your ConfigMap resources using the label app.kubernetes.io/part-of: argocd
, otherwise, Argo CD isn't able to use them.
Disable 'FailFast=false' in DAG
Project: Argo Workflows
Best practice: As an alternative to specifying sequences of steps in Workflow, you can define the workflow as a directed-acyclic graph (DAG) by specifying the dependencies of each task. The DAG logic has a built-in fail fast feature to stop scheduling new steps, as soon as it detects that one of the DAG nodes has failed. Then it waits until all DAG nodes are completed before failing the DAG itself. The FailFast flag default is true
. If set to false
, it allows a DAG to run all branches of the DAG to completion (either success or failure), regardless of the failed outcomes of branches in the DAG.
Ensure Rollout pause step has a configured duration
Project: Argo Rollouts
Best practice: For every Rollout, you can define a list of steps. Each step can have one of two fields: setWeight
and pause
. The setWeight
field dictates the percentage of traffic that should be sent to the canary, and the pause literally instructs the rollout to pause.
Under the hood, the Argo controller uses these steps to manipulate the ReplicaSets during the rollout. When the controller reaches a pause step for a rollout, it adds a PauseCondition
struct to the .status.PauseConditions
field. If the duration
field within the pause struct is set, the rollout does not progress to the next step until it has waited for the value of the duration
field. However, if the duration
field has been omitted, the rollout might wait indefinitely until the added pause condition is removed.
Specify Rollout's revisionHistoryLimit
Project: Argo Rollouts
Best practice: The .spec.revisionHistoryLimit
is an optional field that indicates the number of old ReplicaSets, which should be retained in order to allow rollback. These old ReplicaSets consume resources in etcd
and crowd the output of kubectl get rs
. The configuration of each Deployment revision is stored in its ReplicaSets; therefore, once an old ReplicaSet is deleted, you lose the ability to roll back to that revision of Deployment.
By default, 10 old ReplicaSets are kept. However, it's ideal value depends on the frequency and stability of new Deployments. More specifically, setting this field to zero means that all old ReplicaSets with 0 replicas are removed. In this case, a new Deployment rollout cannot be undone, because its revision history is removed.
Set scaleDownDelaySeconds to 30s
Project: Argo Rollouts
Best practice: When the rollout changes the selector on service, there's a propagation delay before all the nodes update their IP tables to send traffic to the new pods instead of the old. Traffic is directed to the old pods if the nodes have not been updated yet during this delay. In order to prevent packets from being sent to a node that killed the old pod, the rollout uses the scaleDownDelaySeconds
field to give nodes enough time to broadcast the IP table changes. If omitted, the Rollout waits 30 seconds before scaling down the previous ReplicaSet.
It's recommended to set scaleDownDelaySeconds
to a minimum of 30 seconds in order to ensure that the IP table propagates across the nodes in a cluster. The reason is that Kubernetes waits for a specified time called the termination grace period. By default, this is 30 seconds.
Ensure retry on both Error and TransientError
Project: Argo Workflows
Best practice: retryStrategy
is an optional field of the Workflow CRD, that provides controls for retrying a workflow step. One of the fields of retryStrategy
is _retryPolicy
, which defines the policy of NodePhase statuses to be retried (NodePhase is the condition of a node at the current time). The options for retryPolicy
can be either: Always
, OnError
, or OnTransientError
. In addition, the user can use an expression to control more of the retries.
What's the catch?
- retryPolicy=Always is too much: Letting the user retry on system-level errors (for instance, the node dying or being preempted), but not on errors occurring in user-level code since these failures indicate a bug. In addition, this option is more suitable for long-running containers than workflows which are jobs.
- retryPolicy=OnError doesn't handle preemptions: Using
retryPolicy=OnError
handles some system-level errors like the node disappearing or the pod being deleted. However, during graceful Pod termination, the kubelet assigns aFailed
status and aShutdown
reason to the terminated Pods. As a result, node preemptions result in node statusFailure
instead ofError
, so preemptions aren't retried. - retryPolicy=OnError doesn't handle transient errors: Classifying a preemption failure message as a transient error is allowed. However, this requires
retryPolicy=OnTransientError
. (see alsoTRANSIENT_ERROR_PATTERN
).
I recommend setting retryPolicy: "Always"
and use the following expression:
lastRetry.status == "Error" or (lastRetry.status == "Failed" and asInt(lastRetry.exitCode) not in [0])
Ensure progressDeadlineAbort set to true
Project: Argo Rollouts
Best practice: A user can set progressDeadlineSeconds
, which states the maximum time in seconds in which a rollout must make progress during an update before it is considered to be failed.
If rollout pods get stuck in an error state (for example, image pull back off), the rollout degrades after the progress deadline is exceeded but the bad replica set or pods aren't scaled down. The pods would keep retrying and eventually the rollout message would read ProgressDeadlineExceeded: The replicaset has timed out progressing
. To abort the rollout, set both progressDeadlineSeconds
and progressDeadlineAbort
, with progressDeadlineAbort: true
.
Ensure custom resources match the namespace of the ArgoCD instance
Project: Argo CD
Best practice: In each repository, all Application and AppProject manifests should match the same metadata.namespace. If you deployed Argo CD using the typical deployment, Argo CD creates two ClusterRoles
and ClusterRoleBinding
, that reference the argocd
namespace by default. In this case, it's recommended not only to ensure that all Argo CD resources match the namespace of the Argo CD instance, but also to use the argocd
namespace. Otherwise, you need to make sure to update the namespace reference in all Argo CD internal resources.
However, if you deployed Argo CD for external clusters (in Namespace Isolation Mode), then instead of ClusterRole
and ClusterRoleBinding
, Argo creates Roles
and associated RoleBindings
in the namespace where Argo CD was deployed. The created service account is granted a limited level of access to manage, so for Argo CD to be able to function as desired, access to the namespace must be explicitly granted. In this case, you should make sure all resources, including Application
and AppProject
, use the correct namespace of the ArgoCD instance.
Now What?
I'm a GitOps believer, and I believe that every Kubernetes resource should be handled exactly the same as your source code, especially if you are using Helm or Kustomize. So, the way I see it, you should automatically check your resources on every code change.
You can write your policies using languages like Rego or JSONSchema and use tools like OPA ConfTest or different validators to scan and validate our resources on every change. Additionally, if you have one GitOps repository, then Argo plays a great role in providing a centralized repository for you to develop and version control your policies.
[ Download the eBook: Getting GitOps: A practical platform with OpenShift, Argo CD, and Tekton ]
How Datree works
The Datree CLI runs automatic checks on every resource that exists in a given path. After the check is complete, Datree displays a detailed output of any violation or misconfiguration it finds, with guidelines on how to fix it:
Scan your cluster with Datree
$ kubectl datree test -- -n argocd
You can use the Datree kubectl plugin to validate your resources after deployments, get ready for future version upgrades and monitor the overall compliance of your cluster.
Scan your manifests in the CI
In general, Datree can be used in the CI, as a local testing library, or even as a pre-commit hook. To use datree
, you first need to install the command on your machine, and then execute it with the following command:
$ datree test -.datree/k8s-demo.yaml >> File: .datree/k8s-demo.yaml
[V] YAML Validation
[V] Kubernetes schema validation
[X] Policy check
X Ensure each container image has a pinned (tag) version [1 occurrence]
- metadata.name: rss-site (kind: Deployment)
!! Incorrect value for key 'image' - specify an image version
X Ensure each container has a configured memory limit [1 occurrence]
- metadata.name: rss-site (kind: Deployment)
!! Missing property object 'limits.memory' - value should be within the accepter
X Ensure workload has valid Label values [1 occurrence]
- metadata.name: rss-site (kind: Deployment)
!! Incorrect value for key(s) under 'labels' - the vales syntax is not valid
X Ensure each container has a configured liveness probe [1 occurrence]
- metadata.name: rss-site (kind: Deployment)
!! Missing property object 'livenessProbe' - add a properly configured livenessP:
[...]
As I mentioned above, the way the CLI works is that it runs automatic checks on every resource that exists in the given path. Each automatic-check includes three steps:
- YAML validation: Verifies that the file is a valid YAML file.
- Kubernetes schema validation: Verifies that the file is a valid Kubernetes/Argo resource.
- Policy check: Verifies that the file is compliant with your Kubernetes policy (Datree built-in rules by default).
Summary
In my opinion, governing policies are only the beginning of achieving reliability, security, and stability for your Kubernetes cluster. I was surprised to find that centralized policy management might also be a key solution for resolving the DevOps and Development deadlock once and for all.
Check out the Datree open source project. I highly encourage you to review the code and submit a PR, and don't hesitate to reach out.
This article originally appeared on the Datree blog and has been republished with permission.
Comments are closed.