The DevOps Dojo is an educational podcast focused on DevOps and making the world of building software a little better. Each episode covers a principle, practice or common DevOps fable. Join the Dojo to expand your software development horizons!
All content for The DevOps Dojo is the property of Johan Abildskov and is served directly from their servers
with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.
The DevOps Dojo is an educational podcast focused on DevOps and making the world of building software a little better. Each episode covers a principle, practice or common DevOps fable. Join the Dojo to expand your software development horizons!
Chaos Engineering is a set of techniques to build a more proactive and resilient production environment and organization.
In this Episode of the DevOps Dojo, I introduce you to the basics of Chaos Engineering.
Sources
https://principlesofchaos.org/
https://github.com/Netflix/chaosmonkey
https://chaostoolkit.org/
Transcript
Chaos Engineering. The art and practice of randomly introducing failures into our production environment. It seems so counterintuitive that we intentionally break our infrastructure, but Chaos Engineering is a structured and responsible way of approaching exactly this.
Modern distributed systems are so complex that it is difficult to say anything about their behaviour in normal circumstances. This means it is nigh impossible to predict how they will behave when pushed to their limits in a hostile environment. Chaos Engineering allows us to probe our running systems, in order to build confidence in their performance and resilience. I am Johan Abildskov. Join me in the dojo to learn.
The first time I head about Chaos Engineering, was when I learned about the tool Chaos Monkey. Chaos Monkey is a tool created by Netflix that they run in their production, randomly restarting systems and servers. My first reaction was awe. How was this a sane thing to do? It just sounds so wrong. However, I have come to see that this is the only way that you can keep improving your systems in a structured way. Rather than having the unforeseen happen because of user behaviour, you inject tension in the system to discover weaknesses before they cause user-facing problems.
This is the natural continuation of the lean practice of artificially introducing tension in your production lines to continuously optimize productivity.
There are two types of companies, those that are only able to react. Whether it is to the market or to the competition. And there are those companies that are able to disrupt themselves and be proactive in the market.
My money is on the success of the proactive organizations.
But back to Chaos Engineering. To me, the simplest way of explaining Chaos Engineering is that it is applying the scientific method to our production systems. We formulate hypotheses and conduct Chaos Experiments to investigate them.
A chaos experiment has four parts.
First, we defined a steady state as some measurable condition of a system that means it is working as normal. This could be distribution of failure rates or response time on some percentile of the requests. It is key that we have something measurable.
Second, we hypothesize that our system will maintain this steady state under some hostile conditions. This can be server restarts, disk failures or degraded performance in network or in services that we depend on.
Third, we have the execution of the experiment, done with tools like Chaos Monkey or Chaos Toolkit. Here we run the experiment trying to disprove our hypothesis.
Fourth, we collect data and analyse the results. The harder it is to disprove our hypothesis, the more confidence we have in the behaviour of our complex distributed system.
So the four components of a Chaos Experiment are the steady-state, our hypothesis, the execution and the analysis.
But wait, you’ll say. You will tell me I am crazy, that you can’t just to such horrible things to your production environment. And you’ll likely be right. Just as we need to build confidence in our production systems, we need to build confidence in our ability to conduct chaos experiments.
If you are completely new to Chaos Engineering, and perhaps are not even practising disaster recovery, this would be a really good place to start. This helps build our muscle and gets us used to manipulate environments. These trainings are commonly called game days. In some circles, they practice the “wheel of misfortune”, where you randomly select and incident and walk through that.
My best suggestion for a starting disaster recovery training is to test whether you
The DevOps Dojo
The DevOps Dojo is an educational podcast focused on DevOps and making the world of building software a little better. Each episode covers a principle, practice or common DevOps fable. Join the Dojo to expand your software development horizons!