Chaos Engineering: breaking systems to make them stronger

Chaos Engineering is a practice that aims to identify potential issues and vulnerabilities in a system by deliberately introducing controlled failures. The goal is to expose weaknesses before they cause significant damage in a real-world scenario.

Advances in large-scale, distributed software systems are changing the game for software engineering. As an industry, we are quick to adopt practices that increase flexibility of development and velocity of deployment. An urgent question follows on the heels of these benefits: How much confidence we can have in the complex systems that we put into production?

The concept of Chaos Engineering was introduced by Netflix in 2010, who, at the time, had a complex and constantly evolving system infrastructure to support its streaming service.

The company realized that relying solely on traditional testing and monitoring was insufficient to ensure their system’s stability and reliability, which led to the creation of this practice.

In the simplest sense, Chaos Engineering involves introducing controlled failures or stressors into a system to see how it behaves under real-world conditions. The idea is to simulate unexpected events that may occur in a production environment, such as server crashes, network outages, and software errors, to see how the system responds.

By doing this, Chaos Engineering helps to identify potential issues, bottlenecks, and other vulnerabilities in a system, enabling developers and engineers to address these issues before they become a more significant problem.

This approach allows organizations to create more resilient systems and build confidence in their ability to withstand unexpected events.

The process of Chaos Engineering typically involves four key steps:

Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior.
Hypothesize that this steady state will continue in both the control group and the experimental group.
Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.
Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.

ChaosMonkey

In order to automate the testing process, Netflix developed a tool called ChaosMonkey, “a resiliency tool that helps applications tolerate random instance failures.”

Chaos Monkey randomly terminates virtual machine instances and containers that run inside of your production environment. Exposing engineers to failures more frequently incentivizes them to build resilient services.

kube-monkey

Of course, the same concepts behind ChaosMonkey can also be applied to different architectures. Developer Ayush Sobti has in fact published kube-monkey, a similar tool that operates on Kubernetes clusters:

kube-monkey is an implementation of Netflix’s Chaos Monkey for Kubernetes clusters. It randomly deletes Kubernetes (k8s) pods in the cluster encouraging and validating the development of failure-resilient services.

But does it really work?

By identifying potential issues and vulnerabilities in a system before they cause significant problems, organizations can reduce downtime, improve system performance, and increase customer satisfaction. Additionally, Chaos Engineering can help organizations to create a culture of resilience and continuous improvement, leading to more stable and reliable systems in the long run.

However, Chaos Engineering is not a one-size-fits-all solution. The process requires significant planning and execution, and it may not be suitable for all types of systems. Furthermore, there are some potential risks associated with Chaos Engineering, such as the potential for data loss or system downtime if experiments are not executed correctly.

ChaosMonkey

kube-monkey

But does it really work?

Additional reading and resources