How to use Ansible to set up system monitoring with Prometheus

In the third part of this Ansible how-to series, learn how to automate system monitoring.

Image by:

Opensource.com

In summer 2017, I wrote two how-to articles about using Ansible. After the first article, I planned to show examples of the copy, systemd, service, apt, yum, virt, and user modules. But I decided to tighten the scope of the second part to focus on using the yum and user modules. I explained how to set up a basic Git SSH server and explored the command, file, authorized_keys, yum, and user modules. In this third article, I'll go into using Ansible for system monitoring with the Prometheus open source monitoring solution.

If you followed along with those first two articles, you should have:

Installed an Ansible control host
Created an SSH key on the Ansible control host
Propagated the SSH key to all the machines you want Ansible to manage
Restricted SSH access on all machines
Installed a Git SSH server
Created the git user, which is used to check code in and out of the Git SSH server

From a business perspective you have now:

Simplified host management
Produced an auditable, repeatable, automated way to manage those hosts
Started to create a path for disaster recovery (via Ansible playbooks)

To build the skills that businesses need, you have to be able to see resource utilization trends over time. Ultimately, this means setting up some sort of monitoring tool. There are plenty to choose from, including Zabbix, Zenoss, Nagios, Prometheus, and many others. I have worked with all of these; which solution you choose is largely a function of:

Budget
Time
Familiarity

Agentless monitoring tools such as Nagios may use something like SNMP to monitor hosts and expose metrics. There can be many upsides to this approach (such as not having to install an agent). However, I have found that Prometheus, while agent-based, is very easy to set up and provides far more metrics out of the box, so that's what I'll use in this article.

Setting up Prometheus

Introduction to roles

Unfortunately, there isn't a Linux package manager repository available for Prometheus (outside of the Arch User Repository), or at least none are listed on the Prometheus download page. There is a Docker image available, which may be desirable in some cases, but it requires running extra services if Docker is not already present on the target machine. For this article, I will be deploying the pre-compiled binaries to each host. There are really only two files needed for this: the binary itself and a systemd or upstart init file.

Because of this, a single Prometheus installation playbook can be quite involved; therefore it would be a good time to discuss transitioning to an Ansible Role. To put it simply, while you can have one giant YAML file, roles are a way to have a collection of smaller tasks that can be included into a larger play. This is more easily explained through examples. For instance, say you have a user ID that needs to be on every host, yet some of the servers you manage require a web server, while others may have a game server on them. You might have two different playbooks to handle this. Consider the following playbook:

Example 1: Create_user_with_ssh_key.yaml

- hosts: "{{ hostname }}"
  gather_facts: false
  tasks:
    - name: create and/or change {{ username}}'s password
      user:
        name: "{{ username }}"
        password: << some password hash>
    - name: copy ssh keys
      authorized_key:
        key: "{{ item }}"
        user: "{{ username }}"
        state: present
        exclusive: False
      with_file: 
        - ../files/user1_ssh_key.pub
        - ../files/user2_ssh_key.pub

There are a few options available when considering this problem.

Copy this code into the start of each playbook that will be used to create the different servers
Run this playbook manually, before running the server configuration playbook
Turn the create_user_with_ssh_key.yaml into a task, which can be then included in a role using standard Ansible practice

Option 1 is not manageable at scale. Suppose you had to change the password or the username you were creating. You would have to find all of the playbooks that include this code.

Option 2 is a step in the right direction. However, it requires an additional, manual step every time you create a server. In a home lab, this may be sufficient. However, in a diverse environment with the potential for several people to be following the same process to create servers, option 2 relies on the administrator to document and correctly follow all the steps required to produce a functioning server to exact specifications.

To make up for those shortcomings, option 3 uses Ansible's built-in solution. It has the advantage of using an easily reproducible server-build process. Also, when auditing the build process (you are using the source control we set up earlier right?), the auditor can potentially open a single file to determine what task files were automatically used by Ansible to produce a server build. In the end, this will be the best long-term approach, and it is a good idea to learn how to use roles and get into the habit of using them early and often.

Organizing your roles with proper directory structure is critical to both easy auditability and your own success. The Ansible documentation has some suggestions regarding directory structure and layout. I prefer a directory layout similar to this:

└── production
    ├── playbooks
    └── roles
        ├── common
        │   ├── defaults
        │   ├── files
        │   ├── handlers
        │   ├── tasks
        │   └── vars
        ├── git_server
        │   ├── files
        │   ├── handlers
        │   ├── tasks
        │   └── vars
        ├── monitoring
        │   ├── files
        │   ├── handlers
        │   ├── tasks
        │   ├── templates
        │   └── vars

The Ansible system for designing roles can be a little confusing at first, especially since there are multiple places where variables can be defined. In the situation above, In this situation above, I could create a group_vars directory in the production folder like so:

└── production/
    └── group_vars/
        └── all/
            └── vars.yaml

Placing variables in this file will make them accessible to any role that is put in the production folder. You could have vars under each role (such as git_server) and thus have them available to all tasks for a given role:

└── environments/
    └── production/
        └── roles/
        └── git_server/
            └── vars/
                └── main.yaml

Finally, you can specify variables in the play itself. These can be scoped locally to the specific task or to the play itself (thus spanning multiple tasks in the same play).

To recap, you can declare variables:

At the role level for all tasks in a given role
At the playbook level level for all tasks in a play
Inside individual tasks
Inside the Ansible hosts file (a.k.a. inventory); this is used mainly for machine variables and not covered in this discussion

Deciding which scope to create variables can be tough, especially when balanced against ease of maintainability. You could put all your variables at the global level, which makes them easy to find but may not be the best idea for large environments. The reverse of this is to place all variables inside individual tasks, but this can become a real headache if you have a lot of variables. It is worth considering the trade-offs in your specific situation.

Going back to the small playbook in Example 1 above, we might break out our files like this:

├── production
│   ├── playbooks
│   └── roles
│       ├── common
│       │   ├── files
│       │   │   ├── user1_ssh_key.pub
│       │   │   └── user2_ssh_key.pub
│       │   ├── tasks
│       │   │   ├── create_user.yaml
│       │   │   ├── copy_ssh_key.yaml

The contents of the tasks files are identical to the lines in the single, monolithic playbook:

Example 2: create_user.yaml

- name: create and/or change {{ username}}'s password
  user:
    name: "{{ username }}"
    password: << password hash >>

Example 3: copy_ssh_key.yaml

- name: copy ssh keys
  authorized_key:
    key: "{{ item }}"
    user: "{{ username }}"
    state: present
    exclusive: False
  with_file: 
    - user1_ssh_key.pub
    - user2_ssh_key.pub

However, what has changed (potentially) is the way in which variables are passed into Ansible. You can still use the --extra-vars option. However, to demonstrate another approach, we will use the vars/main.yaml file. The vars/main.yaml file has the following content:

username: 'git'
password: 6$cmYTU26zdqOArk5I$WChA039bHQ8IXAo0W8GJxhk8bd9wvcY.DTUwN562raYjFhCkJSzSBm6u8RIgkaU8b3.Z3EmyxyvEZt8.OpCCN0

The password should be a hash and not a cleartext password. To generate a hash on most versions of Linux, you can run the following Python command:

python2.7 -c 'import crypt,getpass; print crypt.crypt(getpass.getpass(), "$1$awerwass")'

In the above command the password salt is denoted in awerwass. These are just random characters I pounded on the keyboard. DO NOT USE THE SAME SALT IN PRODUCTION.

To have these tasks run together, you need to create a main.yaml in the tasks directory. It should have the following content:

---
- include: create_user.yaml
- include: copy_ssh_key.yaml

Finally, create a playbook with the following content:

- hosts: git
  gather_facts: false
  roles:
    43- role: ../roles/common

Your directory structure should look like this:

├── production
│   ├── playbooks
│   │   ├── common.yaml
│   └── roles
│       ├── common
│       │   ├── files
│       │   │   ├── user1_ssh_key.pub
│       │   │   └── user2_ssh_key.pub
│       │   ├── handlers
│       │   │   └── main.yaml
│       │   ├── tasks
│       │   │   ├── copy_ssh_key.yaml
│       │   │   ├── create_user.yaml
│       │   │   ├── main.yaml
│       │   └── vars
│       │       └── main.yaml

Setting up a role for Prometheus

Now that we have covered the basics of creating a role, let's focus on creating a Prometheus role. As mentioned previously, only two files are required for each agent to run: a service (or upstart) file, and the Prometheus binary. Below are examples of each:

Example 4: systemd prometheus-node-exporter.service file

[Unit]
Description=Prometheus Exporter for machine metrics.
After=network.target

[Service]
ExecStart=/usr/bin/prometheus_node_exporter

[Install]
WantedBy=multi-user.target

Example 5: upstart init file

# Run prometheus_node_exporter

start on startup

script
/usr/bin/prometheus_node_exporter
end script

Example 6: systemd prometheus.service (the server service) file

[Service]
User=prometheus
Group=prometheus
ExecStart=/usr/bin/prometheus -config.file=/etc/prometheus/prometheus.yaml -storage.local.path=/var/lib/prometheus/data -storage.local.retention=8928h -storage.local.series-file-shrink-ratio=0.3
ExecReload=/bin/kill -HUP $MAINPID

[Install]
WantedBy=multi-user.target

In my environment, with Ubuntu machines (with and without system) and a large number of Red Hat and Arch machines, I needed to write a playbook to distribute the correct startup scripts to the respective boxes. There are several ways you could determine whether to deploy the upstart or systemd service files. Ansible has a built-in fact called ansible_service_mgr that can be used to suss out the appropriate service manager.

However, I decided to demonstrate how to use scripts to provide Ansible with facts during the Gather Facts stage. This is known as Ansible Local Facts. These facts are read from the /etc/ansible/facts.d/ directory. Files in this directory can be JSON, INI, or executable files returning JSON. They also need to have the file extension .fact. The simple Bash script I wrote checks for the systemd PID, and if found, returns a JSON with the value of true, as seen in Example 7:

Example 7: systemd_check.fact

#!/bin/bash
# Check for systemd if present return { 'systemd': 'true' } 

systemd_pid=`pidof systemd` 
if [ -z "$systemd_pid" ]; then
  echo '{ "systemd": "false" }'
else
  echo '{ "systemd": "true" }'
fi

With this in mind, we can begin to build a simple task to help deploy the Prometheus agent. To accomplish this, the local facts file needs to be copied to each server, the binary and startup scripts need to be deployed, and the service must be restarted. Below is a task that will deploy the systemd_check.fact script.

Example 8: copy_local_facts.yaml

- name: Create the facts directory if does not exist
  file:
    path: /etc/ansible/facts.d
    state: directory

- name: Copy the systemd facts file
  copy:
    src: systemd_check.fact
    dest: /etc/ansible/facts.d/systemd_check.fact
    mode: 0755

Now that our custom facts have been deployed, we can now deploy the binaries needed. But first, let's take a look at the variable file that will be used for these tasks. In this example, I have elected to use the vars/ directory that is localized to the individual role. It currently looks like this:

Example 9: vars/main.yaml

exporter_binary: 'prometheus_node_exporter'
exporter_binary_dest: '/usr/bin/prometheus_node_exporter'
exporter_service: 'prometheus-node-exporter.service'
exporter_service_dest: '/etc/systemd/system/prometheus-node-exporter.service'
exporter_upstart: 'prometheus-node-exporter.conf'
exporter_upstart_dest: '/etc/init/prometheus-node-exporter.conf'

server_binary: 'prometheus'
server_binary_dest: '/usr/bin/prometheus'
server_service: 'prometheus.service'
server_service_dest: '/etc/systemd/system/prometheus.service'

prometheus_user: 'prometheus'
prometheus_server_name: 'prometheus'

client_information_dict:
    'conan': '192.168.195.124:9100'
    'confluence': '192.168.195.170:9100'
    'smokeping': '192.168.195.120:9100'
    '7-repo': '192.168.195.157:9100'
    'server': '192.168.195.9:9100'
    'ark': '192.168.195.2:9100'
    'kids-tv': '192.168.195.213:9100'
    'media-centre': '192.168.195.15:9100'
    'nas': '192.168.195.195:9100'
    'nextcloud': '192.168.199.117:9100'
    'git': '192.168.195.126:9100'
    'nuc': '192.168.195.90:9100'
    'desktop': '192.168.195.18:9100'

For now, you can ignore the client_information_dict; that will come into play later.

Example 10: tasks/setup_prometheus_node.yaml

---
- name: copy the binary to {{ exporter_binary_dest }}
  copy:
    src: "{{ exporter_binary }}"
    dest: "{{ exporter_binary_dest }}"
    mode: 0755

- name: put systemd service file in place
  copy:
    src: "{{ exporter_service }}"
    dest: "{{ exporter_service_dest }}"
  when: 
    - ansible_local.systemd_check.systemd == 'true'

- name: copy the upstart conf to {{ exporter_upstart_dest }}
  copy:
    src: "{{ exporter_upstart }}"
    dest: "{{ exporter_upstart_dest }}"
  when:
    - ansible_local.systemd_check.systemd == 'false'

- name: update systemd and restart exporter systemd
  systemd:
    daemon-reload: true
    enabled: true
    state: restarted
    name: "{{ exporter_service }}"
  when: 
    - ansible_local.systemd_check.systemd == 'true'
   
- name: start exporter sysv service
  service:
    name: "{{ exporter_service }}"
    enabled: true
    state: restarted
  when:  
    - ansible_local.systemd_check.systemd == 'false'

The most important thing to note in the task above is that it is referencing ansible_local.systemd_check.systemd. This breaks down into the following naming convention <how Ansible generates the fact> . <the name of the fact file> . <the key inside the fact to retrieve>. The Bash script systemd_check.fact is run during the Gather Facts stage and then stored in the ansible_local section of all the Gather Facts. To make a decision based on this fact, I check whether it is true or false. The Ansible When clause tells Ansible to execute that specific task only if certain conditions are met. The rest of this task should be fairly straightforward. It uses both the systemd and the service modules to ensure that the appropriate service manager is configured to start the prometheus_node_exporter.

The task to set up the server is very similar:

Example 11: tasks/setup_Prometheus_server.yaml

---
- name: copy the server binary to {{ server_binary_dest }}
  copy:
    src: "{{ server_binary }}"
    dest: "{{ server_binary_dest }}"
    mode: 0755
  when:
    - inventory_hostname = 'prometheus'

- name: Ensure that /etc/prometheus exists
  file:
    state: directory
    path: /etc/prometheus
    owner: "{{ prometheus_user }}"
    group: "{{ prometheus_user }}"
    mode: 0755
  when:
    - inventory_hostname = 'prometheus'

- name: place prometheus config
  template:
    src: prometheus.yaml.j2
    dest: /etc/prometheus/prometheus.yaml
  when:
    - inventory_hostname = 'prometheus'

- name: create /var/lib/promtheus/data
  file:
    state: directory
    path: /var/lib/prometheus/data
    recurse: true
    owner: "{{ prometheus_user }}"
    group: "{{ prometheus_user }}"
    mode: 0755
  when:
    - inventory_hostname = 'prometheus'

- name: put systemd service file in place
  copy:
    src: "{{ server_service }}"
    dest: "{{ server_service_dest }}"
  when: 
    - ansible_local.systemd_check.systemd == 'true'
    - inventory_hostname = 'prometheus'

- name: update systemd and restart prometheus server systemd
  systemd:
    daemon-reload: true
    enabled: true
    state: restarted
    name: "{{ server_service }}"
  when: 
    - ansible_local.systemd_check.systemd == 'true'
    - inventory_hostname = 'prometheus'

  notify: restart_prometheus_server

The keen observer will notice a couple of new things in the server task.

The notify: section
The template: module

The notify section is a way to trigger specific types of events when certain criteria are met. Ansible Handlers are used most frequently to trigger service restarts (which is exactly what is happening above). Handlers are stored in a handlers directory within the role. My handler is very basic:

Example 12: handler/main.yaml

- name: restart_iptables
  service:
    name: iptables
    state: restarted
    enabled: true

- name: restart_prometheus_server
  service:
    name: "{{ server_service }}"
    state: restarted
    enabled: true

This simply allows me to restart the prometheus.service on the Prometheus server.

The second point of interest in the setup_prometheus_server.yaml is the template: section. Templating in Ansible offers some very nice advantages. Ansible uses Jinja2 for its templating engine; however, a full explanation of Jinja is outside the scope of this tutorial. Essentially, you can use a Jinja2 template to have one configuration file with variables whose values are computed and substituted during an Ansible play. The Prometheus configuration template looks like this:

Example 13: templates/prometheus.yaml.j2

global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.

  external_labels:
      monitor: 'codelab-monitor'

scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    static_configs:
      - targets: ['localhost:9090']
  - job_name: 'nodes'
    static_configs:
{% for hostname, ip in client_information_dict.iteritems() %}
      - targets: ['{{ ip }}']
        labels: {'host': '{{ hostname }}' }
{% endfor %}

When the template section is processed, the .j2 extension is automatically removed before putting the file in place on the remote system. The small for-loop in the template iterates over the client_information_dict, which I defined in my variables file previously. It simply creates a list of virtual machines I want Prometheus to gather metrics on.

Note: If you want Prometheus to display hostnames and your DNS is set up correctly, use this instead:

{% for hostname, ip in client_information_dict.iteritems() %}
      - targets: ['{{ hostname }}:9100']
        labels: {'host': '{{ hostname }}' }
{% endfor %}

There are just a few finishing touches left to complete the Prometheus setup. We need to create the prometheus user, (potentially) adjust iptables, tie it all together in the main.yaml, and create a playbook to run.

The setup of the Prometheus user is fairly straightforward, and it will be very familiar if you followed my previous Ansible articles:

Example 14: tasks/create_prometheus_user.yaml

---
- name: Ensure that the prometheus user exists
  user:
    name: "{{ prometheus_user }}"
    shell: /bin/false

The only major difference here is that I am setting the shell to /bin/false so the user can run services but not to log in.

If you are running iptables, you will need to make sure to open port 9100 so Prometheus can gather metrics from its clients. Here is a simple task to do that:

Example 15: tasks/iptables.yaml

---
- name: Open port 9100
  lineinfile:
    dest: /etc/sysconfig/iptables
    insertbefore: "-A INPUT -j OS_FIREWALL_ALLOW"
    line: "-A INPUT -p tcp -m state --dport 9100 --state NEW -j ACCEPT"
  notify: restart_iptables
  when: 
    - ansible_os_family == "RedHat"

Note: I am only running iptables on my Red Hat family of VMs. If you run iptables on all your VMs, remove the when: section.

The main.yaml looks like this:

Example 16: tasks/main.yaml

--- 
- include: create_prometheus_user.yaml
- include: setup_prometheus_node.yaml
- include: setup_prometheus_server.yaml
- include: prometheus_iptables.yaml

The final piece is to create a playbook that encompasses the roles you need to complete your task:

Example 17: playbooks/monitoring.yaml

- hosts: all
  roles:
    - role: ../roles/common
    - role: ../roles/monitoring

Tying it all together

I know it looks like a lot of text to go through, but the concepts for using Ansible are fairly straight forward. It's usually just a matter of knowing how to accomplish the task you set out to do, and then finding the appropriate Ansible modules to help accomplish them. If you have been following this walkthrough all the way through you should have a layout similar to this:

├── playbooks
│   ├── git_server.yaml
│   ├── monitoring.yaml
└── roles
    ├── common
    │   ├── files
    │   │   ├── systemd_check.fact
    │   │   ├── user1_ssh_key.pub
    │   │   └── user2_ssh_key.pub
    │   ├── handlers
    │   │   └── main.yaml
    │   ├── tasks
    │   │   ├── copy_systemd_facts.yaml
    │   │   ├── main.yaml
    │   │   ├── push_ssh_key.yaml
    │   │   ├── root_ssh_key_only.yaml
    │   └── vars
    │   └── main.yaml
    ├── monitoring
    │   ├── files
    │   │   ├── prometheus
    │   │   ├── prometheus_node_exporter
    │   │   ├── prometheus-node-exporter.conf
    │   │   ├── prometheus-node-exporter.service
    │   │   ├── prometheus.service
    │   │   └── systemd_check.fact
    │   ├── handlers
    │   │   └── main.yaml
    │   ├── tasks
    │   │   ├── create_prometheus_user.yaml
    │   │   ├── main.yaml
    │   │   ├── prometheus_iptables.yaml
    │   │   ├── setup_prometheus_node.yaml
    │   │   └── setup_prometheus_server.yaml
    │   ├── templates
    │   │   └── prometheus.yaml.j2
        └── vars
            └── main.yaml

To run your playbook, enter:

[root@ansible-host production]# ansible-playbook -i <path to host file> playbooks/monitoring.yaml

You should now be able to create users, push ssh-keys, check for the existence of systemd, and deploy either the prometheus_node_exporter or the Prometheus server binary to the appropriate servers. Prometheus should initialize with a basic configuration including the list of hosts that were specified in the vars/main.yaml file in the monitoring role.

Congratulations! You've now automated the installation and configuration of a basic Prometheus server and configured your hosts to start transmitting data. As a pleasant side effect, you have also effectively documented all the steps required to accomplish your goals.

Addendum: When I conceived this series, I was going to work through installing Prometheus in OpenShift; however, in reviewing the documentation for Ansible installer for OpenShift, I found that is already contained in a playbook and is an option in the installer.

1 Comment

This work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.