Injecting chaos experiments into security log pipelines

Keeping today's complex systems secure requires a new approach: chaos engineering

magnifying glass on computer screen, finding a bug in the code

Image by:

Opensource.com

Security teams depend on high-quality logs for most preventative security efforts. Preventing an incident from occurring requires observable insight into where the failure might come from, and logs are one important source for such insights. When an incident occurs, organizations must be able to respond and contain them as quickly as possible. Logs are not only essential to find the source of a problem, but they also help identify appropriate countermeasures.

But what happens when an organization doesn’t have the right log data? When an unknown or unforeseeable event occurs, how can we gain insights into why we didn’t see it coming?

Consider this scenario: You go to work as a security incident response engineer one fine Monday morning. As soon as you walk into your office, you are informed that the HR department has suddenly lost access to the content, which includes some highly sensitive data, on their shared network drives. Further examination shows that all of the files and directories on the drive have been renamed to .exe. At this point, you are almost certain that it is the result of some kind of a malware and you have a security incident on your hands.

As an experienced professional, you know just where to start investigating: You check your security monitoring solution to view the firewall logs, and you see that your monitoring solution hasn’t been collecting those logs for over a week. Turning to your firewall, you notice that it is configured to retain the logs from only the last 48 hours. You follow them as far back as you can, but it's clear that you need logs older than that to obtain any meaningful information. What happens next?

In the best-case scenario, you are ultimately able to weed out a possible source of the incident by collecting logs from all of your network devices to find other relevant events. It's a time-consuming process, but it might help solve the puzzle.

Worst case, you discover that your security monitoring solution hasn’t been receiving logs from a huge number of network devices for a long time, and troubleshooting means you must wait until your access request is approved, then manually log into each relevant device and manually track all the relevant information you can find. There is no easy way to pivot around any of this information and no trend information to observe. In such situations, it’s wise to remember that we cannot change the past—and our past usually becomes a prologue.

How quickly would you and your team be able to respond to the scenario described above? If your log flow suddenly stops, how long would it take for your team to realize it? Would they be able to identify the failure quickly and be able to fix it in a matter of minutes or hours? How would you detect it, and what makes make you objectively believe that? If and when you detect the outage, would you know how to bring the system back up? Who would be responsible for doing that? How do you build a robust log pipeline in the first place?

In this article, I will attempt to answer these questions.

Challenges of automation in DFIR (Digital Forensics & Incident Response)

Almost every security organization today is embracing automation, although the extent varies depending on the industry and scale of the business. Most tech companies run on a combination of cloud-hosted infrastructure and physical data center space. Services such as Google G Suite or Microsoft Office 365, third-party HR applications, internal or cloud-hosted versions of GitHub or Bitbucket, firewalls, routers, switches, and anti-virus tools are some examples of the kinds of infrastructure typically used. These components are often required to run a small or medium-sized business in today’s digital economy; larger businesses require larger technology stacks.

For a security incident response team, that’s at least a dozen different systems to collect logs from, and each of these systems handles log processing and transmission differently. In rare cases, collecting data is as simple as specifying the central logging server/bucket the logs should go to. But more often it requires introducing a series of agents between the log source and the log destination, writing scripts to fetch the logs from the source, finding a way to schedule it, creating an agent to transfer it to the destination, and so on. As Lisanne Bainbridge suggested in her research paper 35 years ago, one of the classic ironies of automation is that trying to automate can end up making the system a lot more complex than its manual counterpart. Also, the degree to which a system is automated depends entirely on the automator’s creativity and imagination.

Gaining new insights with security chaos engineering

There are many different ways to approach this challenge. A solid place to start is to start monitoring and alerting on log trends. In the era of expensive SIEM (security information and event management) tools, it is relatively easy to set up alerts for unusual spikes and dips in the log flow. This can be done for every pipeline from which a SIEM receives logs. It is considered a best practice to have a centralized log data lake that is constantly monitored for access and activities.

To be proactive about log pipeline health, it’s important to schedule to test the functionality of the alert triggers at least annually. However, in our experience, pipelines are tested when they are implemented; after this point, teams depend solely on the quality of the SIEM alert to let them know whether a pipeline is broken.

Keeping this use case in mind, I propose an alternative approach: chaos engineering. Rapidly gaining popularity in the site reliability engineering (SRE) world, chaos engineering is an empirical, systems-based approach that addresses the chaos in distributed systems at scale and builds confidence in their ability to withstand realistic conditions. The team learns about the behavior of a distributed system by observing it during a controlled experiment. In layman's terms, chaos engineering is the practice of breaking your own systems on purpose in order to observe and derive new insights from the results, including the knock-on effect it has on systems.

Some readers might consider this practice similar to traditional red-teaming or penetration testing, but in fact, it differs from both in purpose and methodology. The ultimate goal of red-teaming is to gain access to sensitive resources through deceptive adversarial methods without causing disruption to live systems to determine the effectiveness of security preventative controls. However, the ultimate goal of security chaos engineering is to learn through careful and methodical systems experimentation. Our approach focuses on injecting failure into specific components to reveal unknown and unforeseeable problems in the system before they impact the operational integrity of core business products and services.

As Homo sapiens, we must face the fact that systems are evolving faster than our cognitive reasoning can interpret. Despite what prior knowledge we possess, we choose to see only what we want to see and hear only what we want to hear. Our understanding and beliefs reflect what we believe. The purpose of chaos engineering is not to break things, but to learn new information about how our complex adaptive systems really work versus what we thought we knew.

Returning to the event logging example described above, we, as an engineering team, were operating under the assumption that the system was working correctly. These kinds of mistakes cost companies millions of dollars every hour. Consider the July 2018 Amazon Prime Day outage where Amazon incurred costs up to $33 million per hour while engineers scrambled to diagnose and triage the problem. That three-hour outage could potentially have been identified proactively using resilience techniques such as chaos engineering.

When searching for the unexpected, the logical approach is to objectively make it expected. Our systems are evolving so rapidly and becoming so complex that we must seek new methods such as chaos engineering to better understand how they work, improve them, and anticipate the nonlinear nature of their unpredictable outcomes.