Get the highlights in your inbox every week.
LinuxFest Northwest interview with Ilan Rabinovitch
Finding the signal in the noise of Linux system monitoring
Ilan Rabinovitch is well-known to anyone who has done the conference circuit around Southern California. He's a helpful and friendly guy who I met once at a BarCamp years ago and more than once encountered on IRC. I often ended up getting great tech tips from him by sheer proximity.
Ilan is speaking at this year's LinuxFest Northwest, training attendees on system monitoring. I spoke with him to see what he'd be covering.
Anyone starting out in System Administration hears a lot about monitoring. Broadly speaking, what is monitoring and how's it done?
As systems administrators and application developers, we build and deploy services that our colleagues or users depend on. Monitoring is the practice of observing and recording the behavior of these services to ensure they are functioning as we expect. Monitoring allows us to receive meaningful, automated alerts about potential problems and ongoing failures. It also allows us to quickly investigate and get to the bottom of issues or incidents once they occur.
What should I be monitoring? Is it just my network, or are there other things that I should be looking at? Will your talk address those things?
In short, monitor everything. Collecting data is cheap, but not having it when you need it can be expensive, so you should instrument everything and collect all the useful data you reasonably can. So yes, you want data about your network, but also about your operating systems, your applications, and even business metrics.
Individually, they might be interesting. Combined, they can help you tell the full story of why an incident occurs in your environment. You never want to be in a situation where you cannot explain why you experienced an outage or service degradation.
That being said, the specifics of what to monitor and alert on will differ from environment to environment. What's important to you and your application may not be critical for me. During the session at LinuxFest Northwest, we'll review some frameworks and methods for categorizing your metrics and identifying what is important to you and your business.
The more you monitor, the more you find you have to sift through. What are some of the tactics you propose a sys admin takes to mitigate that?
It's important to distinguish between monitoring and alerting. Just because you collect a piece of data for trending or investigating purposes doesn't mean you need to page a team member every time it moves.
If we're paging someone, it should be urgent and actionable every single time. As soon as you start questioning the validity of alerts and whether they require a response, you are going to start picking and choosing when to respond and investigate alerts from your monitoring. This is called pager fatigue and significantly undermines your monitoring goals.
You primarily want to focus your alerting around work metrics: metrics that quantify the work or useful output from your environment. These represent the top-level health of your systems and services and are the true indicator of whether your systems are performing correctly. Your users will never call you to say "CPU is high," but they might complain about slowing responses on your APIs or the service being down entirely. So why are you waking up engineers about high CPU usage?
What are some of your personal favorite monitoring tools?
I'd have to say Datadog is my favorite tool at the moment! And it's not because they're my employer. It's actually the other way around. The reason I joined Datadog this past summer was due to how much I loved using their products as a customer. We're a mix of a hosted service and open source agent that runs on your servers to collect metrics. We tend to focus on environments with dynamic infrastructure (containers, cloud, and auto-scaling or scheduled workloads), as well as aggregating metrics from a number of sources including other monitoring systems.
The open source world has seen some great developments and improvements in our toolsets in recent years. While Nagios and Cacti or Ganglia have been the workhorses of the open source monitoring stack for the better part of the last 20 years, we now have a number of new tools such as Graphite and Graphana for time series data, ELK for log management, and much more.
Which tool you pick to form your monitoring stack will depend on your environment, staff, and scaling needs. The monitoringsucks project offers a great overview of available tools. Brendan Gregg also has some amazing resources for understanding Linux system performance on his blog. I especially like the Linux Observability slide.
Setup and needing to learning new tools often deter people from monitoring. How long did it take you to get comfortable with monitoring and the tools that go along with it?
I've been working with some form of monitoring for over 15 years, but I wouldn't say I've mastered it. Monitoring, like other focuses of engineering, is an ongoing learning experience. Each time I build a new environment or system, it has different requirements either because of new technologies or new business needs.
With that in mind, with each project I get to reevaluate which metrics are most important and how best to collect them.
Not knowing how to diagnose an issue is another monitoring deterrent. What do you do when an issue raises a red flag, but you have no idea how to solve it?
Burying one's head in the sand is never a solution. Even if you don't know how to respond, it's much better to be aware that your service or systems have failed than to find out when irate users or customers call.
With that in mind, start simple. Find the metrics and checks that tell you if your service is online and performing the way your users expect. Once you've got that in place, you can expand your coverage to other types of metrics that might deepen visibility into how your services interact or of their underlying systems.
How do you keep from falling into a false sense of security once your monitoring system is up and running?
Post-mortems and incident reviews are great learning opportunities. Use them to learn from any mistakes, as well as to identify areas in your environment where you would benefit from more visibility. Did your monitoring detect the issue, or did a user? Is there a leading indicator that might have brought your attention to the issue before it impacted our business?
What's your background? How did you get started in the tech business, and what do you do?
I got my start tinkering with a hand-me-down computer my parents gave my siblings when they upgraded the computer they used for their small business. I did quite a bit of tinkering over the years, but my interests got a jump start when I started playing with Linux with the goal of building a home router. I quickly found myself attending user group meetings with groups like SCLUG, UCLA LUG, and others. The skills I picked up at LUGs and in personal projects helped me pick up consulting and entry-level sys admin work.
Since then, I've spent many years as sys admin. I've led infrastructure automation and tooling teams for large web operations like Edmunds.com and Ooyala. Most recently, I've had the opportunity to combine my interest in large scale systems with my open source community activities as director of technical community at Datadog.
You clearly do a lot with open source, and for the open source community. Why's open source important to you?
The open source community and Linux helped me get my start in technology. The open nature of Linux and other software made it possible me to dive in and learn about how everything fit together. Then LUGs and other community activities were always there to help if I ever got stuck. It's important to me to stay active the community and give back so that others can have similar experiences.
Wait a minute, aren't you that one guy from SCaLE? How did you get involved with SCaLE and how's it going?
Indeed! I'm one of the co-founders and the current conference chair of SCaLE. SCaLE started back in 2002 as a one-day event focused on open source and free software at the University of Southern California. We started SCaLE at time where there were very few tech events in the Southern California area. Additionally, the available FOSS/Linux focused events were primarily organized as a commercial ventures (e.g., LinuxWorld). Between geography, cost, and in some cases, age requirements, we found it was quite difficult to see developer-led sessions about the open source projects we were interested in. In the spirit of open source, we decided to scratch our own itch and build a place where we could attend the sessions we were interested in seeing. Naively, we thought, "How hard could it be to start a conference?"
Thus, SCaLE was born.
We are now launching planning for our 15th year of SCaLE, which will be held March 2-5, 2017 at the Pasadena Convention Center. While the event has grown from just a few hundred our first year to 3,200 attendees, we've made a strong effort to keep our community feel. We like to think we've succeeded in meeting our original goals of bringing community, academia, and business together under a single affordable and accessible conference.
In that case, I have to ask: What distribution of Linux do you use, and what's your current desktop or window manager?
My home machine runs Ubuntu, so Unity on the desktop. My personal servers tend to be Debian-based.
At Datadog, we primarily use Ubuntu across our infrastructure, although we work with a wide set of Linux distros to ensure our open source agent works well across the board. Our Docker containers start off from a Debian image.