Don't test in production? Test in production!

Image by:

opensource.com

If you last updated your IT security standards five or more years ago, chances are they don't line up well with the realities of today's DevOps and site reliability engineering (SRE) practices. One particularly sticky topic is testing in production—and, thus, testing with production data—because DevOps and SRE blur the line between what is production and what is not; what is a test and what is not.

To clear up some of the confusion, we'll dig into these questions:

Why do we separate dev/test and production systems?
What should we manage according to the high standards of a production system?
Why is testing on production systems so risky?
Why should we test in production?
What about production data?
How can we make testing in production less risky?

I should note that this is an opinion piece; it's based on years of collective DevOps and testing experience, but it's not to be read as an official IBM statement.

Why do we separate dev/test and production systems?

It's standard practice to treat development, test, and production systems differently, at least from a compliance and risk-management standpoint, mostly because they have differing security, data, and privacy controls. Let's take a step back and think about the historical reasons for the different attitudes toward these deployment environments.

Our production systems are the most important because they run our businesses and governments. These systems serve our customers and have a direct impact on customer satisfaction. It's normal for a developer's working environment to be "broken" for a few hours now and then, but we must manage production systems according to impeccable standards of quality, reliability, and availability. That's why it's crucial to limit risks to our production systems. DevOps and SRE still focus on risk avoidance, but they use different risk-reducing strategies, as compared to other practices (such as ITIL).

In addition, production systems are special because they have access to production data. Production data must be a reliable source of truth, so we must protect it from corruption. Production data is also likely to contain information that we can share only with authorized users, such as confidential or personal data, so we must ensure that it's protected by production-level authentication and authorization. Finally, we may need to maintain an audit trail of accesses to production data (create, read, update, and delete), which isn't needed for dev/test systems.

Image by:

^{Production systems are more tightly controlled and monitored, for good reasons.}

We also need excellent visibility into and control over the current state of our production systems. We monitor them carefully so we can quickly detect problems, and when problems occur, knowing the current configuration of those systems makes it easier to recover quickly. Most people couldn't care less whether a developer changes the configuration settings on their personal laptop, but we lock down production systems to a known configuration and with secure change controls in place. Whether we lock down the configuration via a change-control database or infrastructure-as-code, the goal is the same: visibility and control.

Finally, remember that we manage dev/test and production systems differently because there are compliance rules and regulations specifically for production systems. Few things kill velocity quicker than unnecessary burdens on our dev/test environments!

What should we manage according to the high standards of a production system?

When we started thinking about testing in production, we quickly realized that we were beginning with an assumption: It should be easy to determine what is and what is not a production environment. But, as happens with most assumptions, we were wrong. Developers and testers want to move fast; when in doubt, we tend to classify systems as dev/test instead of production so that we don't have to deal with the production system management overhead. But how do we know when we need to put production controls in place? It's not all black-and-white, but there are several considerations.

A few of the more clear-cut examples: We can agree that developer laptops and environments that are specifically designed for testing (e.g., integration test, system test, performance test) are not production systems. Furthermore, there's a general consensus that systems that serve real customers using real data, directly or behind the scenes, are production systems. There are also systems that we only use internally, which are critical enough to the operations of the company that they are also considered production systems.

Image by:

^{Modern software development and delivery practices can blur the line between development, test, and production systems.}

Often, though, the line between "production" and "non-production" depends on your unique situation and on your use of these terms:

Staging
Pre-production
Pre-live
Preview

Your staging environment, for example, might be one that you run only tests against, in which case, it's more of a test system. On the other hand, your staging environment might be what your business partners use to test new APIs before you release them. In that case, you should manage it like a production system, for most intents and purposes, because you expect it to simulate the real user experience of those APIs. Maybe you can tolerate a bit more downtime for that type of server, but you should use production-quality authentication and authorization; you should put controls on your server configurations, and you should monitor the servers like a production system.

A content management system's preview environment is another example of a system that sounds like one type but is really the other. Preview content is not published yet. Maybe it's time-sensitive too, such as a website for an unannounced product. Someone will publish the new product's web pages for all the world to see after the announcement; but before publication, they are highly confidential. Therefore, the preview environment must have even more authentication and authorization controls than the production environment. It must not render a preview page unless the current user has the right to see it.

We should treat these like production systems too:

Blue/green deployments. Why? The backup environment that is not getting traffic could become the production environment at any moment.
Backup servers in a high-availability configuration. Why? The backup servers could start serving production traffic at any time.
Canary deployments. Why? These serve a small portion of production traffic.
Staged rollouts. Why? All versions of the hardware and software that are serving production traffic are "in production."
A/B testing servers. Why? Even though "testing" is in the name, these serve production traffic too.

It's important to be consistent when you apply rules and heuristics to your systems and environments. You shouldn't consider a staging environment a production system one day and a test system the next. That's a recipe for disaster. Make sure everyone understands which of the systems are production systems and which are not, then document your team's decisions and any exceptions.

Putting effort into understanding which systems are production systems and which are not, and treating them appropriately, will ensure that you protect your production systems without hurting your development and test velocity.

Why is testing on production systems so risky?

When people say, "Don't test in production," it's because they want to avoid several possible (bad) outcomes:

Corrupted or invalid data
Leaked protected data
Incorrect revenue recognition (canceled orders, etc.)
Overloaded systems
Unintended side effects or impacts on other production systems
High error rates that set off alerts and page people on call
Skewed analytics (traffic funnels, A/B test results, etc.)
Inaccurate traffic logs full of script and bot activity
Non-compliance with standards

Why should we go ahead and test in production anyway?

Yes, testing in production is risky, but we should still do it, and not in rare or exceptional cases, either. These tests-in-production are accepted as best practices in the DevOps and SRE communities:

A/B testing and experiments
Usability testing and UX research
Final smoke testing of blue/green deployments
Feature flags
Staged roll-outs
Canary testing
Health checks and other production system monitoring, including scripted health tests
Visual regression testing of web pages to compare staged vs. production versions
Accessibility regression testing (after initial testing and deployment)
Scripts that scan web pages for broken links and report errors
Real-user monitoring
Chaos engineering
Failover testing
Other testing of high-availability/disaster recovery plans
Bug bounty programs

Production tests help us:

Prevent bad deployments from breaking production systems
Objectively identify which user experiences are more effective
Design more delightful user/site interactions
Gradually roll out new features
Get quick feedback on success or failure of our latest changes
Catch problems before users notice them
Understand web page performance characteristics and change impact
Build more resilient systems
Improve system quality

By running several types of production tests, either at deployment time or on a frequent schedule, we can cover a variety of critical non-functional requirements:

	User experiences	Availability	Roll-out changes	Feedback	Quality	Performance	Resilience
A/B testing	✔			✔
Usability/UX	✔			✔	✔
Smoke testing		✔	✔		✔		✔
Feature flags	✔		✔	✔
Staged roll-outs	✔		✔	✔
Canary testing		✔	✔	✔			✔
Health checks	✔	✔			✔
Regression testing	✔				✔
Broken-link checkers	✔				✔
Real-user monitoring	✔			✔		✔
Chaos engineering		✔			✔	✔	✔
Failover testing		✔			✔		✔
HA/DR testing		✔			✔		✔
Bug bounty	✔			✔	✔
Penetration testing		✔			✔		✔

^{Goals of various types of production tests}

…and one outlier (because isn't there always a problem child?): Third-party penetration ("pen") testing. Should we do it on production systems? On the one hand, it's undeniably risky; for example, if pen testers find an injection vulnerability, you could end up with corrupted data in your database. On the other hand, hackers are probably running pen testing suites on your internet-facing systems every week. Therefore, pen testing is happening on many of your production systems whether you approve of it or not. That's why I've included it in this list of production tests. I have two recommendations:

Make sure that your pen testers-for-hire are working with a production-like environment and not a toy.
Run the most popular security test suites against your test systems and fix any errors you find before everyone else runs the same tests against your production systems.

Image by:

^{Your production systems need to be resistant to hacking attempts and handle them gracefully.}

Finally, did you notice that these tests-in-production have something in common? None of them makes "test copies" of production data. They all operate directly on real production systems and data.

What about production data?

Here's a shortcut: Dev/test environments may not need special test data. They can often use production data that's available to anyone, like actual web page content, as long as your tests won't modify that data, saving you the time and expense of creating test data. All of the production tests in the previous section fall into this category, as do many web services and APIs.

But beware! Just because data is available on the internet or a REST API is free to use doesn't mean you can use it for your dev/test purposes. Make sure you understand and comply with any applicable license agreements and website usage agreements before you take and use open data.

It's great if you can save time and money by using production data, but some of your applications and services need to modify your data store, so you'll need tests that modify data as well. Running these tests in production is difficult to do without corrupting your production systems. Faced with this reality, when it becomes necessary to modify the data store in order to validate a test scenario, most development teams choose to have different data sources: one for dev/test and one for production. But how do you set up test data that's realistic and complete enough for a good test?

If your production database is small enough, you could technically make a copy of it, then test that, but copying production data into a dev/test environment is problematic because it can bypass security and privacy controls. (GDPR, anyone?)

Let's take an example. You've put into place your carefully thought-out security and privacy controls. You set up your production systems so that only people with a "need to know" can access any personal data; you know where your data is stored; you've established a process for removing personal data from your systems on demand; and so on. Maybe you have customer addresses and phone numbers in your database. If someone copies the database to a dev/test system, and you didn't implement your security and privacy controls there, you have a hole in your system. If a customer exercises their "right to erasure" and requests that you delete their address and phone number, how will you know which dev/test systems you need to update to remove that information? What if a developer's laptop or a test mobile device with personal data on it is stolen? Will you have to report and mitigate a security breach? To close these holes, you need to either keep personal data off of your dev/test systems or include dev/test systems with access to personal data in your production data compliance scope.

Image by:

^{Locking your laptop and phone with a chain isn't the best plan. Assume these devices will be stolen eventually and decide how to manage them accordingly.}

The obvious alternative is to use mock data for dev/test environments, but deciding when to use mock data for testing is difficult because creating and maintaining mock data is time-consuming and error-prone. If you start with a production database and manually scrub it to obscure or remove sensitive data, you might miss something. If you go too far with a scrub, you might corrupt the data or limit its testing usefulness. Conversely, if you build up a test database from scratch, it's difficult to create all of the permutations and edge cases you need to include for a good test suite.

Only you and your teammates can decide which direction is the right one to take for your unique purposes, but here are some considerations that will help:

Understand what controls are on your production data before you use it and respect those controls.
Make sure that your entire team agrees on the meaning of sensitive data and personal data.
Define and get acceptance of your protocols for handling sensitive and personal data.
Document and understand the sensitive and personal data in your systems. Make sure that your developers and testers understand exactly what sensitive and personal data they use.
If you decide to sanitize production data to remove personal or sensitive information, ensure that the process for sanitizing the data isn't itself a security/privacy/compliance hole. Consider third-party scrubbing and sanitizing software, because it's less likely to cause errors than a homegrown solution.

How can we make testing in production less risky?

The intent of the adage, "Don't test in production" is to protect our production systems. Now that we've established that we can and should test in production every day, how do we keep our production systems safe?

Image by:

^{Have a testing plan and complete it before putting a new component into production. Well-tested code is less likely to fail in production.}

First and foremost, test all systems thoroughly with automated tests before you get to production. I'm a firm believer in 100% automated unit test coverage, with unit tests in the same change set or pull request as the code changes they validate. You should complete multiple layers of testing before going to production: functional/behavioral testing, integration testing, deployment/configuration testing, accessibility testing, security testing, and high-availability failover testing. And if you're doing gradual roll-out of new features, test the roll-out process too. And yes, I believe in manual testing too, but never as a replacement for test automation!

What about free-form testing in production? Two words: Be careful. "Bug hunt" days, for example, are a best practice where you ask everyone on your team to spend a given amount of time trying to find bugs in your software. These are fun and productive… as long as you set up the proper guardrails. Review with your team the risks of testing in production and teach your bug hunters what they should not do, such as place orders using their personal credit card and immediately cancel the order. These kinds of tests can interfere with revenue recognition and order statistics.

Also, don't reconfigure your production systems manually, not for testing purposes nor for any other reason. Manual changes leave your systems in an unknown state, and it can be very difficult to get them back to a known configuration. Manage your production systems' configuration using infrastructure-as-code (Chef, Helm, etc.) and/or release management and orchestration tooling, like IBM UrbanCode Deploy or Kubernetes.

Before any chaos engineering, make sure you've planned your experiments in keeping with the basic principles of chaos engineering:

Plan an experiment
Contain the blast radius
Scale or squash

You can also reduce the risk of chaos engineering by meeting these prerequisites: solid automated test coverage, good monitoring and alerting, a high-availability setup with fast automated failover, and a team that's on-call and ready to restore service within your service-level objectives (SLOs) if something fails. Within my team, we normally run the first fail-over tests for each component on a weekend with the developer on call present, and teams "graduate" to testing for resilience during normal working hours after they pass the first set of tests.

Image by:

^{Don't let chaos engineering, scalability, or performance testing cause a cascade of problems in your system. Loosely coupled components, good error handling, and planned failure modes help to isolate failures.}

When planning scalability and performance testing, make sure that you won't impact your customers. Don't throw a bunch of API requests at your production system and hope for the best. Use a separate, isolated environment if the cost-benefit analysis justifies it. If you need to test scalability or performance in production, ramp up traffic gradually while monitoring your systems, and stop before service disruption or failure. And don't forget to filter out scalability/performance test traffic from your production analytics!

These risk-reduction techniques will help you keep your production systems resilient and less likely to fail due to testing in production.

Conclusion

Testing in production is extremely valuable and a best practice in modern software engineering, IT operations, and IT security. Production tests help us:

Prevent bad deployments from breaking production systems
Objectively identify which user experiences are more effective
Design more delightful user/site interactions
Gradually roll out new features
Get quick feedback on success or failure of our latest changes
Catch problems before users notice them
Understand web page performance characteristics and change impact
Build more resilient systems
Improve system quality

Therefore, we should not avoid testing in production; rather, we should understand the inherent risks and build safeguards into our systems to address them. We should also update our security and compliance standards to take modern production testing practices into account.

Thank you to the attendees of the "test in production" open space at Devopsdays Charlotte, who collaborated to brainstorm and distill what we really mean by the terms "test" and "production" and what we really need to do to protect our production systems. I would also like to thank Craig Cook and Jocelyn Sese for their helpful feedback on early drafts of this article.