Chaos and engineering are two words you would never have seen together before the year 2010. Probably because engineering is generally considered constructive while chaos is, well, quite the opposite. That being said, when you first read about Netflix’s Chaos Monkey, the original chaos engineering tool, it’s tempting to imagine a monkey pulling out wires and hammering on systems with a wrench at random. Quite the contrary, however, chaos engineering is a disciplined approach to finding faults before they become outages. This is done by proactive failure testing or injecting failures in a controlled manner so that the system’s ability to respond can be measured and improved upon. These failures could be anything from injecting latency to network failures to high CPU and memory usage. You can even simulate a DDoS (distributed denial-of-service) attack or any other event that would cause disruption to your services. Quoting Seth Eliot from AWS, “You can’t consider your workload to be resilient until you hypothesize how your workload will react to failures.” And that brings us to Chaos Mesh.
Built for Chaos
Chaos Mesh 1.0 hit general availability earlier this month and is an open-source, CNCF hosted, cloud-native chaos engineering solution that orchestrates fault injection for complex systems on Kubernetes. Chaos Mesh was developed at PingCAP for fault-testing their open-source database system TiDB. It is the second CNCF-hosted project from PingCAP after TiKV, the distributed key-value storage engine. PingCAP CTO Ed Huang stated that “traditional deterministic testing” isn’t enough to ensure resilience in distributed systems on Kubernetes.
Built exclusively for Kubernetes, Chaos Mesh covers chaos experiments on everything from pods to networks to file systems and even the kernel itself. Chaos Mesh aims to be a “universal” and neutral chaos testing platform that’s easy-to-use, as well as scalable to the petabyte level. Being Kube-native, it requires no special dependencies or modifications and can be deployed directly on clusters. Another advantage is that it integrates well with other testing frameworks and can even perform chaos experiments in production environments.
Unlike in the past, where servers being “up” was an indication of everything being as it’s supposed to be, today it’s about functionality and user experience. Similar to how Netflix checks its pulse by calculating the number of times the play button is pressed, Chaos Mesh works by simulating pod downtime using TiKV and then analyzing how this affects the QPS (queries per second). If the button in question doesn’t do what it’s supposed to, users are going to keep pressing it, causing the QPS to go up and indicating that something isn’t functioning right.
In such a case where a crashed pod causes the QPS to fluctuate and then come back to acceptable levels within a minute, you know your system is resilient, and you can move on to the next test. However, in the case where it takes up to nine minutes to return to normal, like in this example, further investigation is required. Chaos Mesh features two kinds of pod-level fault injections, namely, pod-kill and pod-failure. While the latter simulates a pod being unavailable, the former, as the name suggests, simulates a pod being killed, you can also create custom pod-level faults like PodNetworkChaos or PodIOChaos.
Now bear in mind, a crashed pod is just one example of fault injection that Chaos Mesh is capable of. There are six chaos types: PodChaos, NetworkChaos, TimeChaos, StressChaos, IOChaos, and KernelChaos. Concerning networking errors, in particular, you have a lot more options than the other chaos types. There’s network-delay, network-corrupt, network-loss, network-duplication, and network-partition, all of which affect the network as their names suggest. For filesystem chaos, you have just two, I/O delay and I/O error that simulate file system I/O delays and errors, respectively.
Other major chaos experiments include container-kill, an important one because it lets you kill a specific container in a pod, CPU-burn that stresses the CPU of a selected pod, and Memory-burn that stresses its memory. TimeChaos is also an important one, especially in a distributed environment where it’s critical to maintain a synchronized clock across all nodes to remain compliant and maintain security. TimeChaos introduces a fault called “clock skew” that basically introduces a time difference between the clocks on different nodes to see how your systems respond.
Now CustomResourceDefinition or CRD is a powerful Kubernetes feature that basically extends the Kubernetes API to allow you to define custom resources. Chaos Mesh uses this ability of Kubernetes to define custom chaos objects like the 6 chaos types we mentioned earlier. Chaos Mesh also leverages existing CRD implementations wherever applicable and allows you to use existing CRDs to build new objects to start chaos experiments with. New objects can be created or existing CRDs updated with a YAML file or a Kubernetes API.
The YAML method is the preferred method here, especially since it can conduct chaos experiments post-deployment. To create your own custom chaos experiment, you need to create your own YAML config file though it’s pretty easy to find examples online. Alternatively, CRDs can be manipulated directly through the Kubernetes API. The fact that Chaos Mesh defines chaos objects with CRDs makes it a natural fit for Kubernetes for Kubernetes-based applications. This also makes installation a lot easier since it’s as simple as applying a set of CRDs to a cluster.
Chaos Mesh components
At the moment, Chaos Mesh is made up of two core components, Chaos Operator, which performs the orchestration (GA and is fully open-sourced), and Chaos Dashboard, that’s essentially a web UI for creating, managing, and monitoring chaos experiments (under development). The Chaos Operator is also referred to as the controller-manager and uses object controllers to manage CRDs, and admission-webhook controllers to insert sidecar proxies into containers. The Chaos Operator also features a Chaos Daemon functioning as an agent, running a pod on each node.
The Chaos Daemon here is a daemonset running with privileged permissions for a specific node. The controller-manager relays the required actions to the daemonset that, in turn, uses its privilege to manipulate the system to affect the target pods as required. Admission Webhooks comes into the picture when we’re looking to simulate more serious IO type failures, which are achieved by injecting chaos-sidecars into pods during app deployment and basically hijacking the IO by intercepting file-system calls. Based on how the above scenario affects the QPS, the necessary changes can be made to avoid outages caused by similar situations.
Chaos Mesh competition
Chaos Mesh is by no means the only option for a chaos engineering platform, and another CNCF hosted competitor that checks a lot of the same boxes is called Litmus or LitmusChaos. We say this because, like Chaos Mesh, Litmus is an open-source, cloud-native, uses CRDs for chaos management, and is built for Kubernetes. Other popular options include the original chaos engineering tool Chaos Monkey, Gremlin that offers chaos engineering as a Service, Chaos Toolkit, and KubeInvader.
Breaking good with chaos engineering
While chaos engineering may sound like a new-fangled oxymoron and breaking things in production may just sound just crazy, breaking things “productively” is quickly becoming the standard for resiliency testing in Kubernetes.