Everyone loves building shiny, new systems using the latest technologies and especially the most modern DevOps tools. But that's not the reality for lots of operations teams, especially those running larger systems with millions of users and old, complex infrastructure.
It's even worse for teams taking over existing systems as part of company mergers, department consolidation, or changing managed service providers (MSPs). The new team has to come in and hit the ground running while keeping the lights on using a messy system they know nothing about.
We've spent a decade doing this as a large-scale MSP in China, taking over and managing systems with 10 million to 100 million users, usually with little information. This can be a daunting challenge, but our four-phase approach and related tools make it possible. If you find yourself in a similar position, you might benefit from our experience.
Phase 1: Stop the bleeding
As any good combat medic knows, our first priority is to stop the bleeding while working hard to save the patient. This means talking to existing teams—and especially end users—about the system's most urgent problems. These are often instability, slow performance, and security issues, in that order.
Often there are also serious, hidden issues such as failed backups, dead RAID disks, and open security ports—all of which we hunt down early on. Thus, in addition to the users' problems survey, we do a quick scan of the system looking for obvious issues. From these investigations, we build lists of all problems—the ones we see and the stuff we need to fix later.
We also make sure all the backups, including offsite, are working and make our own backups in case we break something while fixing things. This happens all too often.
Then we fix as many urgent things as we can along the way to stemming the blood loss, especially changing configurations (using our own if we can), closing public ports, fixing Java heap allocations, adjusting Apache worker counts, etc. along with setting up basic logging and monitoring so we can better see what we can't see.
Our first toolset in this phase includes our operating system, service, and cloud audit/governance tool along with our deep configuration management database (CMDB) systems, which give us a detailed view of key issues, anti-patterns, overloads, bad configurations, open ports, misconfigured heaps and workers, bad SSL, and so on.
We also use deep monitoring to look at what's really going on. This includes monitoring the site reliability engineering (SRE) Golden Signals to see rates, errors, latency, and saturation at every level of the system, from the disks up through databases, app servers, web servers, and each subservice in the application.
This anti-bleeding phase usually takes a week to a month.
Phase 2: Find all the bodies
Once the patient has been saved and is more-or-less stabilized, it's time to find out what we have, especially where all the medium and long-term problems are. The goal in this phase is to discover and document, while fixing more things along the way, and get on the path to building a real plan to overhaul as much as is safely possible over the next few weeks.
One of the key problems in this phase is figuring out how all the parts are related. This can be a real challenge, even without microservices, especially in older and larger systems with many services running on single hosts, several databases of various types floating all around the system, plus caches, load balancers, proxies, NFS, and more all over the place, often doubled-up with other things.
This all makes for a very brittle system and sadly, we've broken many systems while trying to figure them out or making tiny adjustments that took out seemingly totally unrelated services.
Our toolset here includes our CMDB, service & link Discovery, auto-diagramming, and log analytics systems, all of which let us look deeply into what is going on. We'll also use application performance management (APM) tools (when we can) to see further where code bottlenecks are, especially when we find issues in the databases.
The body-discovery phase usually takes a couple months.
Phase 3: Rebuild the racecar during the race
Finally, we have to rebuild the system. This usually means replacing every component with a newer version of itself, on the latest OS and software version, with best-practice configurations, all secured, monitored, and backed-up correctly. This must be done while the system is running, of course, with little or no downtime. And ideally mostly during the day, as we are never excited about updating dozens of systems at 3am, not to mention the mistakes that happen when working while half-asleep.
We build a master plan with careful sequencing so we can change things piece by piece. Ideally, we add high-availability early on so we can take parts of systems offline as we go. Every bit requires very careful coordination with large numbers of stakeholders, including developers, operations, support, help desks, and even marketing (to avoid promotional periods).
Our rebuild toolset includes lots of careful manual work plus as many automated tools as we can apply, including cloud automation (CloudFormation, Terraform, etc.), configuration tools (mostly Ansible), and more. All are tied to our server design and auto-build systems using our best-practice configurations for various services.
This usually takes a few months to a year, as it often relies on busy third parties such as app development, networking, and security teams and perhaps financial approval.
Phase 4: Manage for the long-term
After we've saved the patient, fixed all its problems, and rebuilt it for the future, we must keep the system up and running, managed 24x7. This is a whole new phase where our hard work of rebuilding the system has paid off, as ideally, it's all smooth sailing from then on. In reality, large and dynamic systems will have lots of issues over time, but our work to update the architecture, versions, configurations, monitoring, and more should pay off in future years.
Steve Mushero will present Taking Over & Managing Large Messy Systems at LISA18, October 29-31 in Nashville, Tennessee, USA.
Comments are closed.