IT Failures Are Inevitable
This is the first of a series on resilience, self-healing systems, and ongoing testing. This blog focuses on how xMatters introduces failures to test the ability of our systems to handle the chaos without service disruption.
Systems fail for many reasons. When it’s not bugs, it’s misconfiguration or interactions between services that cause unexpected behavior. Sometimes, the network is down, slower than usual, or flooded with requests. Finally, there are those rare natural events that can render data centers inoperative.
In the face of these threats, companies are turning to the microservice architecture for increased service availability and resilience to failure.
Those are great reasons for wanting to break monolithic applications; but unless the new services are designed with resiliency in mind, they are likely to compound the risk of failure. After all, more moving parts means more things can go wrong. And if more things go wrong, you’ll spend more time recovering those services. That doesn’t sound very appealing.
“Distributed systems are inherently chaotic”
The Principles of Chaos Engineering tell us that “distributed systems are inherently chaotic” and that services should be developed with two things in mind: failure, and the ability to automatically recover.
At xMatters, we take a proactive approach to handling failure. In fact, we instigate failures on a regular basis in our infrastructure to make sure that our systems can handle the chaos without disrupting services to our clients. In this post, I will give a high-level overview of our approach.
According to Chaos Engineering, we should be able to break services in production and have the system recover without impacting clients. Of course, it is a good idea to test resilience before changes make it to production, so it is important to be able to simulate production-like traffic in test environments. This helps us be sure that the changes we bring to our services support the average load we see in production, as well as the peaks of requests we get on a regular basis. In fact, we know that our product is able to handle twice that amount without having to scale up.
When it comes to resilience testing, the best tools do more than sending requests into individual services and checking that they are accepted without errors. Under certain conditions, the input service can fail to hand off a request to the rest of the system without reporting the failure. The reasons aren’t always clear.
Maybe the request went into a queue that doesn’t have a consumer. Maybe the issue is reported but using the wrong error level, so it went under the radar of the monitoring tools. Or maybe the service did output errors, but because of some great re-try logic, the request ended up reaching the correct destination.
The point is that for the purpose of Chaos Testing, load tests should not be concerned about how requests are handled, just that they have the correct impact. Also, there is little point to putting load in services individually when assessing how all services play together. It is an end-to-end kind of validation.
Once we establish a baseline for how our services behave under load, we start introducing failure events and observe what happens. Like with all software testing, it is best to have automated tools that allow the reproduction of scenarios quickly and effortlessly. In addition to being able to verify fixes and changes to the services, this allows running random failure scenarios on a schedule in any environment.
According to Netflix, “The best defense against major unexpected failures is to fail often.” This was the motivation behind the construction of Chaos Monkey, one of Netflix’s first chaos testing tool, which is now available on GitHub. In addition to Chaos Monkey, the last few years have seen the birth of a few different tools to help resilience testing. After reviewing a series of them, we opted to create our own tool which we named Cthulhu (as an analogy to the cosmic entity from H.P. Lovecraft, known for rendering insane anything it interacts with). We made our own tool mainly so we could coordinate complex failure scenarios, impacting different infrastructure technologies in a data-driven manner.
There are advantages to Chaos Testing that go beyond asserting that systems are recovering automatically, particularly while a system is in transition toward a fully automated recovery strategy. Failure scenarios expose ways that the system can fail. Before self-healing logic is in place, the engineering team has to recover the services manually. This allows confirmation that the engineering team is able to recover the system within SLAs. It also serves as a drill for engineers to remain up to date as to how to keep their ever-changing system running.
The engineering team will come up with a way to automate the recovery. In the meantime, they must know how to get all services back on track. The best way to achieve this is to have random failure scenarios run on a regular basis and to not disclose the details of a run to the engineering team. That way, we can include the discovery and diagnosis of the issue in our drill (this is, after all, is part of the SLAs).
Since its inception, Cthulhu has allowed us to improve the design of many of our services. Among other things, we found and addressed cases where services didn’t get updates when downstream services changed. We also found areas where monitoring was missing.
Having resolved these cases, our system can now detect services that become unresponsive and restart them. This tool has really become an integral part of our development cycle. As we keep on developing our Chaos Engineering practice, we discover, on occasions, new failure cases and quickly proceed to resolve them. It is always exciting to find new ways to make our product more robust.
When incidents occur (and they will), xMatters uses integrations to get notifications and information into the systems and to the people that need it for fast resolution. Try xMatters free and see why with xMatters, an incident notification is only the beginning of the resolution process.