Through the looking glass: Security and the SRE

It's time to take a more proactive approach to system security. Here's how chaos engineering can play a key role.

Image by:

Opensource.com

"We can no longer design state-dependent security in a stateless world." —Rinehart

"We form the hypothesis, but we never test it." —Bergstrom

Over the last few years, DevOps, chaos engineering, and site reliability engineering (SRE) have made foundational shifts in the engineering community worldwide. The discipline of security engineering has gradually made its way into DevOps with DevSecOps, rugged DevOps, and other name variants. Our purpose in this article, which focuses on applying chaos engineering and SRE to the field of cybersecurity, is to share the insights we've gathered in our journey and to challenge the community to think differently about how security systems are designed.

The need to think differently about information security is paramount, as distributed computing threatens to have devastating effects on security. The methods, capabilities, and instrumentation many organizations use today lack the security traits and objective measurements needed to keep pace in a modern engineering ecosystem. Unless controls and monitors are built into all layers of an application from the beginning, the application will inevitably fail to maintain security.

Like a fence around a building, basic security controls serve only as a deterrent: Just as the fence can be easy to break through, credentials and access can be easy to achieve. Once an intruder determines how to gain access, they potentially have free reign to the entire building. To prevent this, modern buildings feature security doors, limited-access elevators, and built-in camera systems. The most secure floors also include special lobbies and even armed guards. System security should be designed with a similar approach: limiting access to API paths, allowing easy blockage of requests, whitelisting authorized actions and actors, and encrypting all traffic and storage. Like unauthorized visitors attempting to access a building, system requests that don’t belong should be flagged and stopped.

Feedback loops in distributed systems and security

Even as modern software becomes increasingly distributed, rapidly iterative, and predominantly stateless, today's approach to security remains predominantly preventative, focused and dependent on state in time. It lacks the rapid iterative feedback loops that have made modern product delivery successful. The same feedback loops should exist between the changes in product environments and the mechanisms employed to keep them secure. Security measures should be iterative and agile enough to change their behavior as often as the software ecosystem in which they operate.

Security controls are typically designed with a particular state in mind (i.e., production release on Day 0). Meanwhile, the system ecosystem that surrounds these controls is changing rapidly every day. Microservices, machines, and other components are spinning up and spinning down; component changes are occurring multiple times a day through continuous delivery, external APIs are constantly changing on their own delivery schedules, etc. Security tools and methods must be flexible enough to match the constant change and iteration in the environment. Without a security feedback loop, the system will eventually drift into security failure, just as a system without a development feedback loop would drift into unreliable operational readiness.

A new approach to instrumentation: Don't just test—experiment

The constantly changing stateless variables in modern distributed systems make it nearly impossible to understand how systems are working at any given moment. One way to approach this problem is through robust systematic instrumentation and monitoring. You can break security instrumentation into two primary buckets: testing and experimentation. Testing is the validation or assessment of a previously known outcome—or, in plain terms: We know what we are looking for before we go looking for it. Experimentation seeks to derive new insights and information that was previously unknown.

Site reliability engineering and chaos engineering

"If you haven’t tried it, assume it's broken." —Unknown

"No plan survives first contact with the enemy." —Helmuth van Moltke

Two primary responsibilities of an SRE are to quantify confidence in the systems they maintain and to drive additional confidence in the systems' ability to perform to expectations. Confidence can be measured by past performance and predicted by understanding future reliability. Thorough testing helps predict future outcomes with enough detail to be practical and useful. The more completely a system is covered by tests and experiments, the less uncertainty and potential system unreliability. With enough testing, you can make more changes before system reliability falls below an acceptable level.

These same approaches to testing and instrumentation also apply to security experimentation. One of Google’s core principles of SRE is “practice, practice, practice.” It’s important to not only increase feedback loops within systems through consistent and rigorous testing but also to ensure that teams are operationally ready and battle-hardened when needed.

Injecting security failure events into systems helps ensure resilience to vulnerabilities we know are caused by human factors. Similar to hiring a security consultant to break into a high-rise building, we are constantly testing our ability to quickly identify and remediate hidden failures. We strive to mirror the security failure modes that are possible or that have historically occurred in our production security control plane and look for ways to simulate these within controlled circumstances. SREs, product teams, and security teams are expected to implement security controls and code their services to withstand potential failures and gracefully degrade when necessary without impacting the business. A common SRE adage is to do it once manually, and the second time, automate it. A primary function of an SRE is to work on automation to improve the system. By continuing to run security experiments, we can evaluate and improve such vulnerabilities proactively in the ecosystem before they become crisis situations.

Security + chaos engineering = security experimentation

Applying chaos engineering to cybersecurity has historically focused on how human factors and system glitches directly affect system security. The most common way we discover security failures is when a security incident is triggered. By this time, it is often too late, and damage has been done. We must take a more proactive approach. Combining the two disciplines has led to security experimentation, which we hope will serve as a foundation for developing a learning culture around how organizations build, operate, instrument, and secure their systems.

Acknowledging and anticipating failure in the way we design secure systems is already fundamentally challenging, if not unraveling, what we thought we knew about how our systems work. We will continue to share our findings as we explore this new domain.

If you are interested in learning more about our research or would like to get involved, please contact Aaron Rinehart or Patrick Bergstrom.

[See our related story, Security Chaos Engineering: A new paradigm for cybersecurity.]