Three years ago when I would talk to engineers and technology leaders about the ideas around Chaos Engineering, only about a fifth of the audience had heard of the concept. Now when I mention the term, most hands in the room go up.
This is due in large part to Netflix’s Chaos Monkey (and the rest of their “Simian Army”) as well as their Chaos Engineering team’s stories on the work they’ve done in the space and the benefits it’s produced.
Casey Rosenthal and Nora Jones running events like Chaos Community Days have also helped raise awareness and build a community around the practice of injecting capital-c Chaos into our complex, socio-technical systems. The goal is to see not only how our technology reacts, but how our teams and the larger organization reacts to failure.
Discovering this information and working on ways to continuously improve our systemic reactions to the curveballs operating modern software throws at us is a noble and valuable endeavor. But there’s one thing that continues to give people pause about about Chaos Engineering: the name. For a business navigating the chaos of a marketplace, it is not particularly appealing to most executives to “unleash Chaos” within their organization, no matter how well it’s planned for. And at a team and individual level, it can sit a bit uneasily as well: as my friend Boyd Hemphill recently asked “Don’t we, as engineers, attempt to mitigate and remove Chaos?!”
That’s why I read with much interest Netflix’s recent decisionto rename its Chaos Engineering team: it’s now the Resilience Engineering team. This is an notable development, but I think it also represents an important shift in thinking about the problem space.
Resilience Engineering addresses a number of different practical and operational components, which make it in many ways a superset of Chaos Engineering. The questions Resilience Engineering suggests we consider also create a fascinating space to contextualize this concept of “chaos” within:
- RE forces us to look at what capabilities are present within a system beforea chaotic event occurs.
- It asks us to think about our capacity to adapt in the face of high tempo, high consequence, and uncertain situations, and to look at ways to fill out or increase that capacity.
- It challenges us to figure out how we can build infrastructure to support this adaptive capacity in a long term, sustainable way, in all parts of our system.
- Lastly, it poses these questions not only of the code we write and the cloud we run it upon, but it asks us to weigh in on what it means for the people that run our systems, the teams those people come together to form, and the whole organization that is comprised of those teams.
There’s no doubt that Chaos Engineering is an important skill for any learning organization to have in its toolbox. But focusing on it, alone, is not enough. Resilience Engineering confronts us with important considerations we need to address, if we are to leverage the potential of chaos, controlled or otherwise.
In effect, Resilience Engineering sets chaos… in context.