Ansible performance: Revving up the engine and maintaining stability

Try these tips and tricks to optimize Ansible and improve its speed and performance.

Image by:

Opensource.com

Ansible is an automation tool centered around configuration and orchestration management. One important thing that many systems administrators and engineers need to be aware of when using Ansible is how to optimize it to improve its speed and performance. Here are some tips and tricks to keep in mind.

Use SSH for multiplexing

At a high level, there are a couple of things to worry about with SSH: the time it takes to establish a new connection to a host and the amount of latency experienced while using that connection. Each task Ansible performs on a host will create an SSH connection to that host per host. If there is some delay in establishing that connection, that delay will be compounded by the number of tasks you want to operate. To help solve this problem, an SSH control socket to a host can be used for multiplexing. If one connection is established (and held open), subsequent connections can use the socket and significantly decrease the time it takes to establish those connections.

Control persist is a feature for SSH that allows an established control socket to stick around for a period of time after its last use. This enables control socket use without having to explicitly establish a long-lived connection to use the socket. With control socket/persist enabled, the first connection to a host will establish the socket, which will persist for a period of time.

The persist time should be long enough so that later tasks in a playbook can still use the socket, but not too long to leave behind stale handlers. Each persistent connection consumes memory, and that is multiplied by the number of hosts in your inventory.

Changes or modifications to the ssh_connection and controlpersist can take place in the global Ansible configuration file (/etc/ansible/ansible.cfg), or you can create your own ansible.cfg file instead.

Enable SSH pipelining

SSH pipelining is different but related. It's more of an Ansible feature than an SSH feature. By default, when Ansible uses SSH and SSH-like connection plugins, it will SSH to the target host multiple times for each task. The details do not matter much, but recall that if you have connection overhead, having multiple connections per task compounds the delays even more.

The pipelining feature, when enabled, will instead establish a single connection to the target and, through that connection, remotely execute the module. Again, the details don't matter much, you just need to know that many connections are reduced to a single connection. It is not enabled by default, mainly because it requires extra configuration on the target host in order to make use of sudo (requiretty must be disabled).

Forking Ansible

One of the best things about Ansible is its ability to operate in parallel across multiple hosts. The number of hosts it can operate on at once depends on multiple factors. The largest factor is the forks parameter. This parameter has a default of 5, which will limit Ansible to operating on only five hosts at one time. A second factor is the number of target hosts in a play. If the play targets four hosts, then it doesn't matter if the forks are set at 5, 50, or 500; Ansible will still only target the four hosts. Thankfully, you don't have to worry too much about fork overhead and trying to target just enough forks for hosts. Ansible will spawn only as many forks as necessary given the target hosts. Setting forks at 500 and having a target of four hosts will result in four forks.

Another large factor related to target hosts is how the play is set up. If the play is set to use "serial" mode, you can configure a batch size. This is the maximum number of hosts that will be batched together and run through the play. (Note that all the hosts in the batch run the play before the next batch runs.) The batch size could be set much smaller than the fork size, and the smallest number wins for parallelism.

When managing large-scale fleets, it might become a requirement to configure forking to sustain over 100 forks of Ansible. You might run into problems when it comes to performance and capacity, such as limits to memory and CPU. One thing you might want to think about when dealing with scaling up to larger and larger numbers is the negative impact to your downstream—because asking that number of systems to accomplish a task could prove problematic. For example, asking 300–500 systems to accomplish a task like fetching a file, even from a single source, may exhaust that resource, causing failure.

Do you need fact gathering?

Fact gathering is essentially an unwritten task. When it is turned on (the default) at the start of each play, each host will get a task to gather facts from it. These facts will become hostvars. This is useful if you need the info, but it does take time. I suggest you turn off fact gathering unless you depend on those facts. This is because gathering facts is resource-intensive and very, very slow; three seconds or more per host. For example, if you have 100 hosts and 50 forks, that'll be 20*3s—or one minute—just to gather facts. This is wasted time if you're not using facts. You can set your playbook to not gather facts by adding the line gather_facts: no.

Before gather_facts: no was added, the playbook took 5.39 seconds with an additional 0.24 seconds to execute the task of creating the file. After adding the line gather_facts: no, the playbook completed in 0.24 seconds.

Concurrent tasks with async

The async task parameter is interesting. It will cause Ansible to close the connection once the task is running. Ansible will re-establish a connection after a certain interval to see if the task has completed. This can be useful to get a large fleet all working on a task as quickly as possible. However, it can also increase the number of connections, as Ansible will connect back frequently to check on the status. If this frequency is made low, you could wind up with hosts sitting finished and idle, waiting for the frequency timer to run down for Ansible to check back in. I should also mention Async's fire-and-forget and check later feature in Ansible. It can be useful to do a long-running background task that you will eventually want to check on later. (Note: This feature requires Ansible 1.8 or older.)

Changes made in /etc/ansible/ansible.cfg, though you can add them inside the actual playbook.

Use pull mode to check for changes

Pull mode is another strategy to increase efficiency. As I wrote above, one of the limitations I experienced was my Ansible control host's ability to manage more than 500 forks. Pull mode (Ansible-Pull plugin) is a way to spread the processing requirements across the fleet. This only works if you do not need central coordination of when the tasks are completed. If your use case can tolerate eventual consistency, then you can publish playbooks in a Git repo and have Ansible on the hosts periodically check for changes to the playbooks. When changes are found, the host can execute the playbook.

Wrapping up

I hope you now have a greater understanding of speed and efficiency when it comes to Ansible. There are multiple ways to improve Ansible's performance, and you should explore them based on your needs and current configuration or setup.