4 questions open source engineers should ask to mitigate risk at scale

What do you do with a finite amount of time to deal with an infinite number of things that can go wrong? 
1 reader likes this.
A bunch of question marks

Opensource.com

At Shopify, we use and maintain a lot of open source projects, and every year we prepare for Black Friday Cyber Monday (BFCM) and other high-traffic events to make sure our merchants can sell to their buyers. To do this, we built an infrastructure platform at a large scale that is highly complex, interconnected, globally distributed, requiring thoughtful technology investments from a network of teams. We’re changing how the internet works, where no single person can oversee the full design and detail at our scale.

Over BFCM 2022, we served 75.98M requests per minute to our commerce platform at peak. That’s 1.27M requests per second. Working at this massive scale in a complex and interdependent system, it would be impossible to identify and mitigate every possible risk. This article breaks down a high-level risk mitigation process into four questions that can be applied to nearly any scenario to help you make the best use of your time and resources available.

1. What are the risks?

To inform mitigation decisions, you must first understand the current state of affairs. We expand our breadth of knowledge by learning from people from all corners of the platform. We run “what could go wrong” (WCGW) exercises where anyone building or interested in infrastructure can highlight a risk. These can be technology risks, operational risks, or something else. Having this unfiltered list is a great way to get a broad understanding of what could happen.

The goal here is visibility.

2. What is worth mitigating?

Great brainstorming leaves us with a large and daunting list of risks. With limited time to fix things, the key is to prioritize what is most important to our business. To do this, we vote on risks, then gather technical experts to discuss highest ranked risks in more detail, including their likelihood and severity. We make decisions about what and how to mitigate, and which team will own each action item.

The goal here is to optimize how we spend our time.

3. Who makes what decisions?

In any organization, there are times when waiting for a perfect consensus is not possible or not effective. Shopify moves tremendously fast because we make sure to identify decision makers, then empower them to gather input, weigh risks/rewards, and come to a decision. Often the decision is best made by the subject matter expert or who bears the most benefit or repercussions of whatever direction we choose.

The goal here is to align incentives and accountability.

4. How do you communicate?

We move fast but still need to keep stakeholders and close collaborators informed. We summarize key findings and risks from our WCGW exercises so that we all land on the same page about our risk profile. This may include key risks or single points of failure. We over-communicate so that we’re aligned and aware and stakeholders have opportunities to interject.

The goal here is alignment and awareness.

Solving the right things when there is uncertainty

Underlying all these questions is the uncertainty in our working environment. You never have all the facts or know exactly which components will fail when and how. The best way to deal with uncertainty is by using probability.

Expert poker players know that great bets don’t always yield great outcomes, and bad bets don’t always yield bad outcomes. What’s important is to bet on the probability of outcomes, where over enough rounds, your results will converge to expectation. The same applies in engineering, where we constantly make bets and learn from them. Great bets require clearly distinguishing the quality of your decisions versus outcomes. It means not over-indexing on bad decisions that led to lucky outcomes or great decisions that happen to run into very unlucky scenarios.

Knowing that we can’t control everything also helps us stay calm, which is vital for us to practice good judgment in high-pressure situations.

When it comes to BFCM (and life in general), no one can predict the future or fully protect against all risks. The question is, what would you change looking back? In hindsight, would you feel confident that you prioritized the most important things and made thoughtful bets using the information available? Did you facilitate meaningful discussions with the right people? Could you justify your actions to your customers and their customers?


This article originally appeared on Planning in Bets: Risk Mitigation at Scale and is republished with permission.

Kathryn Tang
Kathryn manages the business of Infrastructure at Shopify, leading the Engineering Operations group.

Comments are closed.

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.