About a year ago, brick and mortars like restaurants and grocery stores were scrambling to set up delivery and curbside pickup. A lot of them used chaos engineering, in production, to hunt for failures quickly before launching new features and services. It was similar for education platforms that went from nice-to-have to absolutely essential in just a week’s time.
The urgency of the pandemic overcame a lot of reluctance towards adopting a chaos mindset, according to Tammy Bryant Bütow, principal Site Reliability Engineer at Gremlin, a chaos engineering platform for enterprise. Plus, she pointed out in an interview, because people are at home more now, they are faster to tweet if there’s an outage. So it’s a mix of a sense of duty to serve at previously foreseen scale and an act of saving face.
Teams are finally understanding that chaos engineering is essential in order to plan for the unforeseen and to meet spikes in traffic and users (like organisations experienced in 2020). But what is chaos engineering, and how do you persuade your teams to embrace it?
WTF is chaos engineering?
Chaos engineering was named by Netflix to evoke the idea of mischievous monkeys throwing things at your systems. Because nothing could be more unpredictable than a barrel of monkeys let loose — except, perhaps, distributed systems.
For over a decade now, open source Chaos Monkey has been randomly terminating instances in production to test if your systems are actually resilient when that proverbial shit hits that proverbial fan.
Since then, a whole class of chaos engineering tools has cropped up. And we’ve seen the emergence of an operations role — Site Reliability Engineer, or SRE — dedicated to finding and automating the fixing of faults within our systems. SRE is a 50/50 mix of being on-call when things go wrong and experimenting to find hidden vulnerabilities.
Chaos engineering is a unique mix of science and intelligent creativity aimed at increasing the reliability of your systems at scale.
As Bryant Bütow put it, “Chaos engineering is thoughtful, planned experiments designed to reveal weaknesses in our systems.”
Chaos engineering follows the basic scientific method but, since no system, stack, or environment is the same, different experiments are performed at every organisation. And it’s anything but chaotic.
Sylvain Hellegouarch, CTO and co-founder of Reliably, which allows for chaos automation at the infrastructure, platform, and application levels, describes it as “asking questions about your system’s behavior under certain conditions and enabling you to safely try it out live so that you can, collectively with your team, see if there is a real weakness and learn what the right response should be.”
Chaos begins with asking questions like: What if Server X shuts down? If our traffic suddenly scaled by 300%? If the connection between our application and database dropped? If Integration Y failed? And, perhaps most importantly, would monitoring and alerts be working properly if these things happened?
Hypotheses are formed around what should happen. If results differ, you need an easy way to roll things back, and to figure out the cause when things go wrong so you can shore things up. Then you need to automate the chaos testing to make sure the same problem won’t happen again unexpectedly.
The culture of chaos
What chaos engineering comes down to is considerate communication and culture.
“Chaos engineering is about building a culture of resilience in the presence of unexpected system outcomes,” wrote Nora Jones, co-author of the O’Reilly Media book on chaos engineering.
Other than with SREs, the seeming oxymoron of “chaos engineering” is still a tough sell at many companies. And it’s not just the more traditional and usually change-resistant organisations — like financial or health services — that are pushing back. You need to convincingly convey the value of chaos experiments to your engineers early on, as this experimentation almost always involves taking time away from their regular work.
To start, de-emphasise that you are breaking things on purpose. Explain that chaos engineering isn’t chaotic at all—it’s really systematic. It’s more about increasing resiliency to prepare for inevitable, unavoidable incidents.
You also have to make it clear that it’s not about catching teams off guard or pointing fingers. If you are going to perform chaos on a certain part of a system, what better person to help you design the experiment than the person who built that piece and who knows exactly what it’s supposed to do? Your first successful experiments are performed on the uncertainties that are keeping teammates up at night, so ask them what they’re most worried about.
It’s also unnecessary to start chaos engineering in production, something most teams aren’t ready for right away. While your team is getting used to chaos, start running chaos tests in a pre-production area like staging or development and, once you have the processes and response around your chaos in place, then move ahead to the only realistic testing environment of production.
Remember, timing is everything. Make it clear when you are going to perform the experiment, warning anyone who could be affected. And emphasise that your test, or “fire drill,” is meant to uncover weaknesses in a controlled environment during the work day, so you don’t have to wake colleagues up when a customer is complaining about it at two in the morning.
Highlight not only the experiments you’ll be doing, but also the ways you are planning to mitigate any unexpected harm.
Finally, when the chaos experiment is over, make a clear plan with all people affected on how you will remedy the situation, and disseminate the information of what happened: what caused it and how it was fixed. It’s important to construct a timeline and written documentation so future engineers can understand what happened — and also so you can automate these tests.
Chaos in action
This year, Gremlin released the first State of Chaos Engineering report, in which the company surveyed all of its customers and found the most common applications of chaos engineering. Among the findings:
- Organisations that run chaos engineering experiments frequently have greater than 99.9% availability of their systems.
- 23% of organisations have a mean time to resolution of under one hour.
- 60% of respondents said they had run at least one chaos engineering attack.
- 34% of organisations said they run chaos experiments in production.
The most common uses of chaos engineering, according to Bryant Bütow, were monitoring and alerting validation — because without them, you can’t even know anything is wrong to then fix it.
The next common use was what Gremlin refers to as resource starvation — what happens if you spike the CPU or you run out of memory? How about you see how your system reacts when it only has 5% CPU?
Another popular use of chaos engineering is dependency analysis. Bryant Bütow described this as, “I run Service A, and, for it to work, I need Service B, C, and D to work.”
When you’re an engineering manager of a service that isn’t critical, but relies on other services to stay up, she said, you don’t have much leverage to demand uptime from other services. But with chaos engineering, you’re able to prove that you need to make sure these other integrated systems work, too.
One popular use case is to zero in on third-party dependencies, particularly when teams are using a mix of managed services and running their own clusters. For example, take a database and blackhole a whole region, making designated addresses unreachable from your application. A lot of teams are using this feature before they launch, particularly to make sure their multi-cloud environment is reliable. Cloud providers are increasingly making rapid changes that organisations have to adhere to and prepare for. Chaos engineering is one way to do that.
Often users are finding that their third-party dependencies aren’t reliable, but it’s due to complicated configuration that can be ameliorated. For example Azure Kubernetes Service recently replaced Docker with containerd, but Microsoft only gave a couple months’ notice — including over end of year holidays — and many teams are just realising it now.
Distributed systems come with flexibility, speed and security, but they also come with a lot of uncertainty—especially when much of your stack is managed by third-party providers that you can have no control over. Chaos engineering is one of a series of reliability and resiliency practices necessary to learn how your systems perform under pressure—and how to continuously take steps to improve that performance.