At A Glance – Chaos Engineering

A software development technique designed to build resistance to failure

Chaos engineering is a technique used to build resilience in software systems. This is the ability of a piece of software to continue to function even when failures occur. Resilience has become increasingly important with the adoption of the large scale, distributed, often cloud based systems that characterise digitalisation – and our digital services – today. When building new systems, software developers need to make sure that they can be deployed quickly and easily, but they can’t compromise on the quality of service they deliver. This is where chaos engineering comes in very handy indeed.

The concept of chaos engineering arises from an assumption that things will go wrong – at some point – in large scale, distributed software systems. It therefore purposefully introduces failures to the system to see if users are affected. Chaos as a mathematical principle is highly valuable to developers, as it reflects the variety and unpredictability of real world events. In simple terms: in an environment where anything and everything could go wrong at any time, chaos helps engineers to mitigate risks of a breakdown.

Netflix pioneered the use of chaos engineering during their migration to the cloud in 2011. They created Chaos Monkey, a program that disables an entire server at random, to make resilience a priority in the minds of developers. If developers know that a failure is definitely going to happen they are more likely to build resilience into the system – creating a much more reliable service for users. Netflix has since come up with an entire ‘Simian Army‘ of destructive tools to test the reliability of its infrastructure. This is an important technique in ensuring that customers always receive a high quality service.

For more definitions of technology’s key terms, sign up to our free weekly newsletter.