A user's guide to failing faster

5 considerations when encouraging failure
559 readers like this.
Fail faster

Opensource.com

Failure.

Now that's a word with a negative vibe. Among engineering and construction projects, it conjures up the Titanic sinking, the Tacoma Narrows bridge twisting in the wind, or the space shuttle Challenger exploding. These were all failures of engineering design or management.

Most failures in the pure software realm don't lead to the same visceral imagery as the above, but they can have widespread financial and human costs all the same. Think of the failed Healthcare.gov launch, the Target data breach, or really any number of multi-million dollar projects that basically didn't work in the end. In 2012, the US Air Force scrapped an ERP project after racking up $1 billion in costs.

In cases like these, playing the blame game is customary. Even when most of those involved don't literally go down with the ship—as in the case of the Titanic—people get fired, careers get curtailed, and the Internet has a field day with both the individuals and the organizations.

But how do we square that with the frequent admonition to embrace failure in your DevOps culture? If we should embrace failure, how can we punish it?

Failing well

Not all failure is created equal. Understanding different types of failure and structuring the environment and processes to minimize the bad kinds is the key to success. The key is to "fail well," as Megan McArdle writes in The Up Side of Down.

In that book, Megan describes the Marshmallow Challenge, an experiment originally concocted by Peter Skillman, the former VP of design at Palm. In this challenge, groups receive 20 sticks of spaghetti, one yard of tape, one yard of string, and one marshmallow. Their objective is to build a structure that gets the marshmallow off the ground, as high as possible.

Skillman conducted his experiment with all sorts of participants from business school students to engineers to kindergarteners. The business school students did worst. I'm a former business school student, and this does not surprise me. According to Skillman, they spent too much time arguing about who was going to be the CEO of Spaghetti, Inc. The engineers did well, but also did not come out on top. As someone who also has an engineering degree and has participated in similar exercises, I suspect that they spent too much time arguing over the optimal structural design approach to take.

By contrast, the kindergartners didn't sit around talking about the problem. They just started building to determine what works and what doesn't. And they did the best.

Setting up a system and environment that allows and encourages such experiments enables successful failure in agile software development. It doesn't mean that no one is accountable for failures. In fact, it makes accountability easier because "being accountable" needn't equate to "having caused some disaster." In this respect, it changes the nature of accountability.

Designing for accountability

We should consider five principles when we think about such a system: scope, approach, workflow, incentives, and culture.

Scope

The right scope is about constraining the impact of failure and stopping the cascading of additional failures. This is central to encouraging experimentation because it minimizes the effect of a failure. (And, if you don't have failures, you're not experimenting.) In general, you want to decouple activities and decisions from each other. From a DevOps perspective, this means making deployments incremental, frequent, and routine events—in part by deploying small, autonomous, and bounded context services (i.e. microservices or similar patterns).

Approach

The right approach is about continuously experimenting, iterating, and improving. This is the philosophy that DevOps and agile development bring from the Toyota Production System's kaizen (continuous improvement), and other manufacturing antecedents. The most effective processes have continuous communication—think scrums and kanban—and allow for collaboration that can identify failures before they happen. At the same time, when failures do occur, the process allows for feedback to continuously improve and cultivate ongoing learning.

Workflow

The right workflow repeatedly automates for consistency and thereby reduces the number of failures attributable to inevitable casual mistakes like a mistyped command. This allows for a greater focus on design errors and other systematic causes of failure. In DevOps, much of this takes the form of a Continuous Integration/Continuous Delivery (CI/CD) workflow that uses monitoring, feedback loops, and automated test suites to catch failures as early in the process as possible.

Incentives

The right incentives align rewards and behavior with desirable outcomes. Incentives (such as advancement, money, recognition) need to reward trust, cooperation, and innovation. The key is that individuals have control over their own success. This is probably a good place to point out that failure is not always a positive outcome. Especially when failure is the result of repeatedly not following established processes and design rules, actions still have consequences.

Culture

The right culture is, at least in part, about building organizations and systems that allow for failing well—and thereby make accountability within that framework a positive attribute rather than part of a blame game. This requires transparency. It also requires an understanding that even good decisions can have bad outcomes. A technology doesn't develop as expected. The market shifts. An architectural approach turns out not to scale. Stuff happens. Innovation is inherently risky. Cut your losses and move on, avoiding the sunk cost fallacy.

Properly dealing with accountability and failure in agile IT does require appropriate architectures, tools, and processes to be in place. Low-impact experimentation on a fragile monolithic application will be difficult and it will be hard to avoid costly failures and subsequent blame. However, the culture of an organization still plays an outsized role. Legendary management consultant Peter Drucker once famously said that "Culture eats strategy for breakfast." Culture has a similar appetite for many aspects of the software development process.

This article is part of The Open Organization Guide to IT culture change.

User profile image.
Gordon Haff is Red Hat technology evangelist, is a frequent and highly acclaimed speaker at customer and industry events, and is focused on areas including Red Hat Research, open source adoption, and emerging technology areas broadly.

5 Comments

The catchy idea of failing quickly has I think run its course, since by itself it doesn't really say much, and is by itself a bit off the mark. Maybe articles like this will help to flesh this out.
The most important thing about failure is to have enough structure and information that you know how and why you failed, and furthermore this information feeds back in a way that is useful for the next round. In a similar way, even successes are less valuable when you don't know why something succeeded. Was it just a random lottery?

I don't disagree with any of that. In fact, I don't really talk much about the speed of failure in my post. I would argue though that speed of iteration is part of the equation. It's fine to understand why some lengthy expensive project failed after it's put into production but you may not have a "next time" to apply any lessons.

In reply to by Greg P

Now I'm confused. You say you don't say much about speed, but the title is about failing faster.
I never mentioned failing in the production environment, though to believe that something can't be allowed to fail there is Pollyanna thinking.

In reply to by ghaff

In Gordon's defense: He didn't write the headline. I did. Were I to do it over again, I'd likely opt for "A user's guide to failing gracefully" or "A user's guide to failing effectively." I think I just became too enamored with his excellent anecdote of the kindergartners throwing all caution to the window and rushing headlong into playful experimentation.

In reply to by Greg P

To add to Bryan's point, I don't say a lot about speed but it is implicit in a lot of other things. It's difficult both financially and culturally if those failures require writing off investments that have consumed a lot of money and time.

It's definitely important to be able to tolerate and mitigate small scale production failures. Hence, canary deployments, etc. (and Netflix Chaos Monkey.)

In reply to by Greg P

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.

Download the Open Organization Guide to IT Culture Change

Open principles and practices for delivering unparalleled business value.