In this new series, we'll focus on DevOps monitoring and observability tools. Over the next few weeks, we’ll explore metrics aggregation and monitoring, log aggregation, alerting and visualizations, and distributed tracing. Alternatively, you can download the entire open source guide to DevOps monitoring tools now.
Let’s get started.
A tale of two views
Once upon a time, I was troubleshooting some vexing problems in an application that needed to be scaled several orders of magnitude, with only a couple of weeks to re-architect it. We had no log aggregation, no metrics aggregation, no distributed tracing, and no visualization. Most of our work had to be done on the actual production nodes using tools like strace and grepping through logs. These are great tools, but they don’t make it easy to analyze a distributed system across dozens of hosts. We got the job done, but it was painful and involved a lot more guessing and risk than I prefer.
At a different job, I helped troubleshoot an app in production that was suffering from an out-of-memory (OOM) issue. The problem was inconsistent, as it didn’t seem to correlate with running time, load, time of day, or any other aspect that would provide some predictability. This was obviously going to be a difficult problem to diagnose on a system that spanned hundreds of hosts with many applications calling it. Luckily, we had log aggregation, distributed tracing, metrics aggregation, and a plethora of visualizations. We looked at our memory graph and saw a distinct spike in memory usage, so we used that spike to alert us so we could diagnose the issue in real time when it occurred.
When we received an alert, we went to our log aggregation system to correlate the logs to the memory spike. We found the OOM error and the related calls around it. We then understood what application was calling the service that resulted in the spike and used that information to find the exact transaction that caused the issue. We determined that someone had stored a huge file in a database that our service was trying to load, but the service was running out of memory before it could fully load and process the record. We should have been defending against this in the first place, but we were happy to find it so quickly and fix it with very little effort. Once we understood the error, we discovered a lot of records had large files like this, and we didn’t need that part of the record to function properly.
You might think the second situation happened a long time after the first and we had improved over time. Or maybe you suspect that when I changed jobs, my new company had better tooling. In reality, the second situation happened before the first one. I moved from a company with fairly advanced observability tools to one with no observability tools. It was strikingly disturbing as the developer to have an application in production and know nothing about it. I learned a lot about the importance of system observability and the related tools as I began rebuilding that infrastructure. Also, Mike Julian's Practical Monitoring is a must-read for those who want to know more about their systems.
So, what are observability tools? Actually, what is observability?
Observability isn’t just a marketing term; it’s a component of control theory. If you want a quick primer, this video might be helpful. Basically, observability means that you can estimate a particular state of a system based on an output. More generally, a system’s state should be deterministic from its outputs. Controllability, the mathematical dual of observability, of a system requires that a system state be determined by the inputs to the system.
This is a fairly simple concept, but it’s very challenging to put into practice. In a sufficiently complex system, it may be nearly impossible to implement full observability. However, you should strive to get the right outputs that allow you to determine the system’s state, especially when you encounter a failure mode.
Observability tool types
In this series, we’ll dig into different types of observability tools. For each type, we’ll cover what they’re used for, what specific tools are available, some use cases, and any unique characteristics that may come up during your search for a new tool. These are presented in the order you should implement them. Metrics aggregation is first, as it’s often easy to instrument an application built with any modern language. Second is logging because it will require more application modifications but provides tremendous value. Third is alerting and visualizations, which require the first two types for full functionality. And last is distributed tracing, as it may not be necessary in a simple monolith and is much harder to implement fully.
This type of tool generally consists of time-series data. Time-series data is time-ordered data, and it is normally collected with an internally consistent interval. This consistency allows some advanced calculations to be applied to the series and provides for predictive analytics using simple regressions or more advanced algorithms.
These tools deal with data types that are related more to events than to a series of consistent data points. This output is often emitted as a system enters some undesired state. Some systems output a lot of logs that don’t fit this condition. We’ll cover more of the do’s and don’ts of logging in a future article.
This may not appear to fit with the other types listed, as it’s really subsequent to the others, but it provides a consumable output for the other types and can produce its own outputs. These types of tools generally make the system more understandable to humans. They also help create a more interactive system through both proactive and reactive notifications about negative system states.
Much like tracing within a single application, distributed tracing allows you to follow a single transaction through an entire system. This allows you to home in on specific transactions that might be experiencing problems. Due to performance concerns, a sampling algorithm is often applied.
Common DevOps features
There are several aspects you should look for in any type of observability tool. We’ll cover these generally now and will bring them back up in future articles.
This specification was previously called Swagger but renamed when it was adopted by the OpenAPI Initiative within the Linux Foundation. The OpenAPI Specification is a language-agnostic tool that can automatically generate documentation of methods, parameters, and models. This is commonly used to generate RESTful interfaces in HTTP, but it is also protocol-agnostic. A user can create a client in almost any language if one doesn't already exist. Every tool should have this type of API (or should be getting it soon). If your tool doesn’t have it yet, you may want to look elsewhere. Tools that haven't implemented this specification or don't have it on their roadmap likely have other deficiencies in adopting open, modern standards and code.
There are a lot of good tools in this space that aren’t open source but may be the right fit for your company. If you pick one of those tools, make sure its documentation and accessory tooling are open source. Open source observability tools can provide valuable insights into how your other observability tools are functioning (or maybe not functioning). They also offer all the other benefits of any open source project which you can read more about on Opensource.com.
Regardless of whether or not a tool is open source, it should always use open standards when possible. We’ve already discussed one of these, OpenAPI, but there are many more. We’ll discuss these standards in the appropriate sections to ensure you know they exist and where they’re used.
Part of observability and openness is allowing everyone to view data. The tools you pick should be open by default. You may want to restrict some areas, but you’ll want to default to open and close access only if it’s absolutely required. You never know who in your company might want to solve your problem or who you’ll need to bring in to help solve a problem. The last thing you’ll want is access barriers when troubleshooting your income source.
Federated model (preferred)
This is similar to defaulting to open, but it allows everyone to provide input and control their own areas more locally. Many legacy systems are architected in a way that requires all data to flow through a central system regardless of need. This also centralizes control around that data. A federated system allows for local aggregation, processing, and control while allowing a central organization to collect the same data or summarized data. The central system likely only wants a subset of the data stored at the local level. This model increases agility, flexibility, and usability.
In this series, we’ll be exploring each of the observability tool types in more detail. We’ll also help you choose the right tool for your use case. Feel free to read them in any order you want—or you can download the entire guide.