During winter 2012, Netflix suffered an extended outage that lasted for seven hours due to problems in the AWS Elastic Load Balancer service in the US-East region. (Netflix runs on Amazon Web Services [AWS]—we don't have any data centers of our own. All of your interactions with Netflix are served from AWS, except the actual streaming of the video. Once you click "play," the actual video files are served from our own CDN.) During the outage, none of the traffic going into US-East was reaching our services.
To prevent this from happening again, we decided to build a system of regional failovers that is resilient to failures of our underlying service providers. Failover is a method of protecting computer systems from failure in which standby equipment automatically takes over when the main system fails.
Regional failovers decreased the risk
We expanded to a total of three AWS regions: two in the United States (US-East and US-West) and one in the European Union (EU). We reserved enough capacity to perform a failover so that we can absorb an outage of a single region.
A typical failover looks like this:
- Realize that one of the regions is having trouble.
- Scale up the two savior regions.
- Proxy some traffic from the troubled region to the saviors.
- Change DNS away from the problem region to the savior regions.
Let's explore each step.
1. Identify the trouble
We need metrics, and preferably a single metric, that can tell us the health of the system. At Netflix, we use a business metric called stream starts per second (SPS for short). This is a count of the number of clients that have successfully started streaming a show.
We have this data partitioned per region, and at any given time we can plot the SPS data for each region and compare it against the SPS value from the day before and the week before. When we notice a dip in the SPS graph, we know our customers are not able to start streaming shows, thus we're in trouble.
The trouble isn't necessarily a cloud infrastructure issue. It could be a bad code deploy in one of the hundreds of microservices that make up the Netflix ecosystem, a cut in an undersea cable, etc. We may not know the reason; we simply know that something is wrong.
If this dip in SPS is observed only in one region, it's a great candidate for regional failover. If the dip is observed in multiple regions, we're out of luck because we only have enough capacity to evacuate one region at a time. This is precisely why we stagger the deployment of our microservices to one region at a time. If there is a problem with a deployment, we can evacuate immediately and debug the issue later. Similarly, we want to avoid failing over when the problem would follow the traffic redirection (like would happen in a DDoS attack.)
2. Scale up the saviors
Once we have identified the sick region, we should prep the other regions (the "saviors") to receive the traffic from the sicko. Before we turn on the fire hose we need to scale the stack in the savior regions appropriately.
What does scaling appropriately mean in this context? Netflix's traffic pattern is not static throughout the day. We have peak viewing hours, usually around 6-9pm But 6pm arrives at different times in different parts of the world. The peak traffic in US-East is three hours ahead of US-West, which is eight hours behind the EU region.
When we failover US-East, we send traffic from the Eastern U.S. to the EU and traffic from South America to US-West. This is to reduce the latency and provide the best possible experience for our customers.
Taking this into consideration, we can use linear regression to predict the traffic that will be routed to the savior regions for that time of day (and day of week) using the historical scaling behavior of each microservice.
Once we have determined the appropriate size for each microservice, we trigger scaling for each of them by setting the desired size of each cluster and then let AWS do its magic.
3. Proxy traffic
Now that the microservice clusters have been scaled, we start proxying traffic from the sick region to the savior regions. Netflix has built a high-performance, cross-regional edge proxy called Zuul, which we have open sourced.
These proxy services are designed to authenticate requests, do load shedding, retry failed requests, etc. The Zuul proxy can also do cross-region proxying. We use this feature to route a trickle of traffic away from the suffering region, then progressively increase the amount of rerouted traffic until it reaches 100%.
This progressive proxying allows our services to use their scaling policies to do any reactive scaling necessary to handle the incoming traffic. This is to compensate for any change in traffic volume between the time when we did our scaling predictions and the time it took to scale each cluster.
Zuul does the heavy lifting at this point to route all incoming traffic from a sick region to the healthy regions. But the time has come to abandon the affected region completely. This is where the DNS switching comes into play.
4. Switch the DNS
The last step in the failover is to update the DNS records that point to the affected region and redirect them to the healthy regions. This will completely move all client traffic away from the sick region. Any clients that don't expire their DNS cache will still be routed by the Zuul layer in the affected region.
That's the background information of how failover used to work at Netflix. This process took a long time to complete—about 45 minutes (on a good day).
Speeding response with shiny, new processes
We noticed that majority of the time (approximately 35 minutes) was spent waiting for the savior regions to scale. Even though AWS could provision new instances for us in a matter of minutes, starting up the services, doing just-in-time warm-up, and handling other startup tasks before registering UP in discovery dominated the scaling process.
We decided this was too long. We wanted our failovers to complete in under 10 minutes. We wanted to do this without adding operational burden to the service owners. We also wanted to stay cost-neutral.
We reserve capacity in all three regions to absorb the failover traffic; if we're already paying for all that capacity, why not use it? Thus began Project Nimble.
Our idea was to maintain a pool of instances in hot standby for each microservice. When we are ready to do a failover, we can simply inject our hot standby into the clusters to take live traffic.
The unused reserved capacity is called trough. A few teams at Netflix use some of the trough capacity to run batch jobs, so we can't simply turn all the available trough into hot standby. Instead, we can maintain a shadow cluster for each microservice that we run and stock that shadow cluster with just enough instances to take the failover traffic for that time of day. The rest of the instances are available for batch jobs to use as they please.
At the time of failover, instead of the traditional scaling method that triggers AWS to provision instances for us, we inject the instances from the shadow cluster into the live cluster. This process takes about four minutes, as opposed to the 35 minutes it used to take.
Since our capacity injection is swift, we don't have to cautiously move the traffic by proxying to allow scaling policies to react. We can simply switch the DNS and open the floodgates, thus shaving even more precious minutes during an outage.
We added filters in the shadow cluster to prevent the dark instances from reporting metrics. Otherwise, they will pollute the metric space and confuse the normal operating behavior.
We also stopped the instances in the shadow clusters from registering themselves UP in discovery by modifying our discovery client. These instances will continue to remain in the dark (pun fully intended) until we trigger a failover.
Now we can do regional failovers in seven minutes. Since we utilized our existing reserved capacity, we didn't incur any additional infrastructure costs. The software that orchestrates the failover is written in Python by a team of three engineers.