What is Chaos Engineering

& How Do I get Started?

Chaos Engineering: What it is and what it isn't.


Chaos engineering is not just breaking things for fun, although it is fun. Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production. It is an approach for building high-availability systems that can tolerate failures and outages. The goal of Chaos Engineering is to design, implement and deploy an experiment that will expose failure modes in a distributed system. This process can provide insights into how your system actually works when faced with real-world conditions.

What is chaos engineering?

Chaos engineering is a software engineering discipline whose objective is to uncover the weaknesses and faults of a system when under stress. Chaos engineering helps you build more resilient systems by exposing your applications to real-world events that can take them down and observe how they respond. It brings order to chaos, enabling engineers to test their assumptions about how well they understand their production environment. Netflix's engineers developed the methodology when they realized that manual testing wasn't sufficient for finding all the bugs in their system.

How to get started with chaos engineering

You don’t need to go all in right away. Chaos engineering is a concept that can be applied in small increments (which i highly recommend), and getting started with it doesn’t require the same level of commitment as other forms of software testing. In fact, you can start with a single experiment before deciding whether or not to commit resources and time to further experiments.

Start by thinking about what your organization wants to achieve through chaos engineering and find a project where these practices could help you achieve these goals. For example, if your goal is improving customer experience during peak hours when more customers are using your application at once (for example on Black Friday), then consider using an open source tool like Netflix’s Hystrix or Datadog's Chaos Monkey that simulates network errors or terminates processes during peak use periods to simulate how customers would react if those services were unavailable.

Next consider which framework will best suit this particular objective and how much investment is required from multiple teams across IT functions such as Product Management (PM). PM will want visibility into what's being tested so they can measure ROI while Engineering teams may need help setting up test environments; Security teams might require additional training or certifications before participating in any kind of controlled environment where failure could bring down production systems; Operations/Support teams should be involved early so they understand how their role fits into the overall plan but also because they'll likely be providing feedback on how best-practices change once actual outages begin happening more frequently due to increased awareness around potential issues within infrastructure

What chaos engineering isn't

Chaos engineering is not testing. Testing is a way to ensure that your software does what you intend it to do, whereas chaos engineering tests how well your software can handle failure.

Chaos engineering is not an excuse for recklessness in production. While there are some chaotic techniques that can be used safely in production, others should be tested on staging or test environments before being deployed into production.

Why should I implement chaos engineering?

Chaos engineering is a tool used to improve the reliability of your system. By simulating and testing the failure modes of your applications in a controlled environment, you can identify weaknesses, then improve them. For example, if an application fails when faced with an unexpected surge of traffic, but recovers quickly without impacting any other parts of the system, it's said to be resilient.

If you want to see how resilient your system is under various conditions (e.g., high latency or resource contention), this method can help identify potential bottlenecks and show where there might be room for improvement.

Chaos engineering also improves speed by ensuring that systems remain available even when faced with high loads or other issues that may cause slowdowns or outages in production environments. This ensures better availability when needed most—and less downtime overall!

The benefits of chaos engineering

As you can imagine, chaos engineering is a bit of a tough sell. The idea of purposefully bringing down your own system sounds like an awful way to spend your time, but think of it this way: what if you could bring down the system quickly and easily by just pushing a button? In doing so, you would be able to:

Test and validate your systems in a controlled manner

Reveal potential weaknesses before they happen and mitigate risk as soon as possible

Chaos engineering is performed by simulating failures in production environments and monitoring how the application responds under these conditions.

How to implement chaos engineering in your organization

Chaos engineering is a great way to build confidence in your systems, but it's not always easy to know where to start. Follow these steps to get started:

  • Start with a small hypothesis and build from there-

  • Define your goals, scope, metrics and success criteria.

  • Determine what failure scenarios you're most concerned about. You might want to simulate them first with a smaller subset of traffic or data before having engineers working on critical production systems try their hand at chaos engineering. This will help establish how many resources are needed (and what skillsets are required) as well as how resilient the system being tested actually is.

  • Decide on a budget for the project(s). If possible and appropriate for your organization, consider leveraging existing services such as Amazon EC2 or Google Compute Engine for testing purposes; this can reduce overhead costs significantly while allowing teams more time for focus on their core competencies rather than figuring out how best practices work themselves out in one particular cloud provider's ecosystem (or lack thereof).


So there you have it. Chaos engineering is a great way to experiment with the edge of your system and learn about how it will respond under different conditions. It's also not just for large organizations like Netflix or Google - even small teams can benefit from chaos engineering as long as they're willing to try something new!

Did you find this article valuable?

Support Kyle Shelton by becoming a sponsor. Any amount is appreciated!