Prometheus is an open source monitoring and alerting toolkit for containers and microservices. The project is a hit with lots of different organizations regardless of their size or industrial sector. The toolkit is highly customizable and designed to deliver rich metrics without creating a drag on system performance. Based on the organizations that have adopted it, Prometheus has become the mainstream, open source monitoring tool of choice for those that lean heavily on containers and microservices.
Conceived at SoundCloud in 2012, Prometheus became part of the Cloud Native Computing Foundation (CNCF) in 2016 and in August 2018, CNCF announced Prometheus was the second "graduated" project in the organization's history.
Prometheus provides a key component for a modern DevOps workflow: keeping watch over cloud-native applications and infrastructure, including another popular CNCF project, Kubernetes.
Here's how some DevOps organizations are turning open source monitoring with Prometheus into an operational advantage.
Banking on Prometheus
Financial services behemoth Northern Trust turned to Prometheus in June 2017, not for application monitoring, but to get a better view of some of its hardware, according to Alan Strader, an architect and operator at the company. "We also get capacity and performance reporting to tell us when we're running into issues and use it for forecasting and increases in hardware," he explains in a presentation.
While Northern Trust likes the flexibility and granularity of Prometheus, Strader admits to its "fairly steep learning curve" and high upfront costs to educate the team about the toolkit. "But we figured it was significantly cheaper than commercial solutions because there were no recurring hard dollar costs to pay on a monthly or annual service," Strader said. Northern Trust uses Prometheus to keep tabs on more than 750 microservices on its platforms, he says.
Fighting alert fatigue
When your content delivery network (CDN) consists of 116 data centers scattered around the globe, you want to keep an eye on things—especially when you average 5 million HTTP requests per second. Cloudflare provides DNS and DDoS mitigation services for more than 6 million websites. It needed monitoring help, especially with the "alert fatigue" that had started to set in, says Matt Bostock, who works with the platform operations team at Cloudflare.
Cloudflare uses 188 Prometheus servers worldwide, plus four top-level Prometheus servers, for alerting on critical production issues, incident response, post-mortem analysis, and metrics.
Bostock says the deployment taps Prometheus Alertmanager, which de-duplicates Prometheus alerts. "Alertmanager groups the incoming alerts by POP and alert name, which helps us reduce the amount of alert noise we receive," he explains. Cloudflare also sets alerts for symptoms rather than causes, which Bostock says will reduce overall alert volume—and allows the organization to be more proactive. "If you [set] alerts on machines or causes, you're going to have a lot of alerts," he warns.
Simplifying with one service to rule them all
Blessed with some downtime after its first feature-length film, the developers at Montreal-based L'Atelier Animation started looking for alternatives to its existing monitoring system. What it had—a mix of Nagios, Graphite, and InfluxDB—was "a setup with too many moving parts," according to Barthelemy Stevens, head of IT for the studio. The team started looking at new monitoring options for its infrastructure, which includes approximately 300 render blades, 150 workstations, and 20 servers, with almost everything running on Linux-based CentOS.
L'Atelier Animation chose Prometheus after pinpointing four key characteristics: its Node Exporter can be customized to fetch any data from clients; SNMP support obviates the need for a third-party service; its alerting system is superior to Nagios; and it boasts Grafana support, Stevens says.
The upgrade gave the animation studio an opportunity to change the way it monitors everything and inspired the creation of a new custom floor map derived from Prometheus data. "The setup is a lot simpler with one service to rule them all," Stevens says. L'Atelier Animation is also integrating software licenses with Prometheus. "The information will give artists a good idea of who is using what and where," Stevens adds.
Driving better insights
Life360, a mobile app for location, driving safety, and information sharing among family members, manages approximately 20 services in production, mostly location requests from mobile clients, which can spike to 150+ instances.
"We primarily use MySQL, NSQ, and HAProxy, and we found that all the monitoring solutions [used previously] were very partial and required a lot of customization to actually get all working together," says Daniel Ben Yosef, a Life360 infrastructure engineer.
The company needed a better way to monitor its MySQL multi-master cluster and a 12-node Cassandra ring, which holds about 4TB of data. Prometheus performed well in initial testing. "The [proof of concept] results were incredible." Ben Yosef says. "The monitoring coverage of MySQL was amazing, and we also loved the JMX monitoring for Cassandra, which had been sorely lacking."
After a limited deployment of Prometheus, Life360 reports a big gain in visibility and instrumentation and envisions using it in other parts of its data center infrastructure. "As we build out new services, Prometheus is becoming our go-to for instrumentation and will help us gain extremely meaningful alerts and stats about our infrastructure," Ben Yosef adds.
Giving containers a checkup
As a data company focused on improving the lives of cancer patients, Cota Healthcare enriches medical records to create research-grade data and joins it with a suite of analysis, visualization, and management tools. And with the millions of patient records it's entrusted with, visibility and security are paramount to its business. As Cota moved to Kubernetes within the cloud, the company realized it needed to monitor and secure its container environment.
"We also knew we would need visibility into everything," says Ashley Penney, VP of infrastructure at Cota. "You can't operate a system where you don't have any idea what's happening—that doesn't work." To get more in-depth information about its application performance and behavior, Cota chose to take advantage of Prometheus metrics with Sysdig, a performance and security monitoring solution that takes advantage of Prometheus' custom metrics for monitoring, alerting, and troubleshooting.
"From the infrastructure team's perspective, it's nice that we can tell our developers, 'emit metrics with Prometheus and we'll pick them up with our monitoring tool,'" Penney says. "We use Prometheus to generate metrics for Stackdriver and even Google Cloud. And there are a ton of other Prometheus exporters we can use."